This is pretty crazy. I am looking forward to downloading the AI app that lets me diagnose patients via uploading a photo. Poor dermatologists/radiologists/pathologists
At least in the US regulatory context, I'm not sure that's relevant. There are many examples of systems working better than humans at diagnosis or some other task, but the human still being required for legal, insurance, or whatever other reasons that make performance strictly worse.
I think that roles and laws will change to make room for AI. Radiologists and the like will still have roles, but they will have to be fluent with AI and able to work with several at a time, overseeing training, quality control, etc. And the laws will have to change. I'm not sure how they will be or ought to be changed, just am convinced that AI offers such a great increment in qualify and quantity of work done , (and reduced cost? -- I think so) that its gravitational pull for organizations will be immense, and they will be pulled into using it extensivelyl. Lots of other stuff will have to change to make room for AI-- laws, insurance, human job roles, human skill training, patient & customer expectations, record keeping, taxes, hiring . . .
Of course, completely agreed. But timing matters, too. It's entirely possible the laws will take a decade or three to catch up and we'll all be getting subpar care in the interim. It would hardly be the first time.
Yeah, I don't know about laws and the other matters I named to reason about how things might play out in there. It does seem to be that AI has so much going for it (or at least appears to) that the changeover to lots of AI interwoven into professions is going to be pretty fast. Maybe sort of like the changes brought about by cars, or WW2 (women in factories) or birth control pills. But the era in which laws, people's expections, etc. change to fit the new realities may be prolonged and chaotic.
I think it is getting significantly better in the new models like o3 and sonnet 3.7. These models have reasoning overlays and use self-talk to self-critique their responses before answering, and it seems to make a very large difference. There aren't any formal studies out yet, AFAICT, but that's not surprising since the models are only a few months old. This field is moving so fast that research always lags. I'd expect to see preprints with the data you're looking for in a month or two.
I remember reading a story >10 years ago (well before ChatGPT) about how AI was already better than humans at some medical diagnoses, and in fact AI was better than "human with access to AI" because the human overrode the AI incorrectly more often than correctly.
I haven't checked, but I wouldn't be remotely surprised if modern AI is already better than most human doctors at most diagnostics that rely only on images and patient interviews (I'd guess doctors are probably still better if they have to physically poke at you to make the diagnosis).
"Using knee X-rays from the National Institutes of Health-funded Osteoarthritis Initiative, researchers demonstrated that AI models could “predict” unrelated and implausible traits, such as whether patients abstained from eating refried beans or drinking beer. While these predictions have no medical basis, the models achieved surprising levels of accuracy, revealing their ability to exploit subtle and unintended patterns in the data.
“While AI has the potential to transform medical imaging, we must be cautious,” said Peter L. Schilling, MD, MS, an orthopaedic surgeon at Dartmouth Health’s Dartmouth Hitchcock Medical Center (DHMC), who served as senior author on the study. “These models can see patterns humans cannot, but not all patterns they identify are meaningful or reliable. It’s crucial to recognize these risks to prevent misleading conclusions and ensure scientific integrity.”
Schilling and his colleagues examined how AI algorithms often rely on confounding variables—such as differences in X-ray equipment or clinical site markers—to make predictions rather than medically meaningful features. Attempts to eliminate these biases were only marginally successful—the AI models would just “learn” other hidden data patterns.
The research team’s findings underscore the need for rigorous evaluation standards in AI-based medical research. Over-reliance on standard algorithms without deeper scrutiny could lead to erroneous clinical insights and treatment pathways. “This goes beyond bias from clues of race or gender,” said Brandon G. Hill, a machine learning scientist at DHMC and one of Schilling’s co-authors. “We found the algorithm could even learn to predict the year an X-ray was taken. It’s pernicious; when you prevent it from learning one of these elements, it will instead learn another it previously ignored. This danger can lead to some really dodgy claims, and researchers need to be aware of how readily this happens when using this technique.”
“The burden of proof just goes way up when it comes to using models for the discovery of new patterns in medicine,” Hill continued. “Part of the problem is our own bias. It is incredibly easy to fall into the trap of presuming that the model ‘sees’ the same way we do. In the end, it doesn’t. It is almost like dealing with an alien intelligence. You want to say the model is ‘cheating,’ but that anthropomorphizes the technology. It learned a way to solve the task given to it, but not necessarily how a person would. It doesn’t have logic or reasoning as we typically understand it.”
This is a huge problem in using AI to sort job applications or things like loan applications. AIs will consistently not want to hire black applicants or give them loans. Some would say that this is good sense, while others will see massive illegal discrimination. But because no one actually fully understands what the AI is recognizing in patterns, it's not possible to really tell. It just becomes a dangerous landmine, so AIs working in these areas need to be heavily tweaked to not do this, which arguably destroys the value in having AI process things anyway.
If my job depended* on not giving loans that run into trouble: very bad luck for ... members of some groups ( tattooed - oops, careful if in El Salvador / metall in face / I won't comment here on skin color ...) . Sure, you´d act different? - Just as embassies work when denying visa (seen that often): Look at passport - "denied". That AI Scott showed would do a much more sophisticated job! Much fairer. And chances to get a loan or a visum would jump from zilch to: quite good. If the zillion of other hints fit. Or stay at zero, if they do not.
(*Banks want to do business. The optimum number of loans gone bad is NOT zero. State bureaucracies are often worse.)
In Latin, the noun "visum" is netrum. And "visa" the plural. As it is in German.
In English, "visa" is short for Latin "charta visa"='paper that has been seen' - while "charta" is a feminine noun in Latin, I doubt "I got her" is proper English for "I got the visa".
That said, you are obviously correct about "visum" not being used in English. Auto-correct does sometimes work when commenting, sometimes it doesn't. (When it does not, can't will look: can`t ;))
> In Latin, the noun "visum" is netrum. And "visa" the plural. As it is in German.
Well, in Latin, you could use the word "visum" to mean anything that has been seen, like a hidden picture within a larger image once you've spotted it, but it would generally be regarded as a verb. The noun is absent but implicit. [Though it's not clear what the noun would be. You'd use neuter gender if you wanted to imply the noun "thing", but that won't work if you actually include the noun "thing", since it's feminine. But a feminine substantive implies the noun "woman".] It certainly wouldn't work in the sentence "as embassies work when denying visa". Embassies are not in the business of denying "things".
There is no Latin noun "visum" separate from the genderless verb.
No, but probably a legal one where you're allowed to make "predictions" without a medical license. Maybe to be safe, you dress it up as tarot-card or crystal-ball reading, and then to be safe from laws banning that, you say it's for entertainment and not meant to be taken seriously.
Diagnosis would be "yes this patient has a bad kneecap because of years scrubbing floors". Prediction is "before the x-ray was taken, he drank a can of Bud Light".
reminds me what Scott(?) posted a week or two back: if you do enough statisitcal tests on a chunk of data, something will come up as significant. Similarly if AI looks for enuf correlations it will find one purely by chance.
This seems like an accurate description of at least some significant part of the problem. It is certainly the reason that doctors tend to look for at the obvious medical markers for guidance. One would think that you could create prompts that forced weight of observations to be tempered by their degree of medical connectedness (or some such).
It will be grossly negligent to do it unassisted given that the loudest signel in the data will be the x-ray machines age. Let's hope they don't just rumber stamp it.
Even if it was possible for AI to make a perfect diagnose there will be rules to stop it from doing that and to protect dermatologists/radiologists/pathologists from going bankrupt.
You jest, but the lack of software liability is a big problem. Inherently faulty and/or misconfigured/misused software costs tens or hundreds of billions USD each year, and the culpable parties can for the most part just shrug and move on as if bad software engineering was force majeure. Extending this literally irresponsible approach to product safety from software to everything else would be a disaster.
So yes, in the end there really has to be someone to sue, because negligence and mistakes will happen. It will cost money and lives, and there needs to be a deterrent beyond mere market forces. If, say, a car has a severe construction defect in the brakes, you can and should sue them if that causes an accident. But if the defect is in the AI and it causes an accident, we should just move on because...why?
You're right of course, but it's very difficult to apportion blame with software (even more so if not open source and if it can be patched before being offered as evidence). Is it the company that provided the software or the company that used it and made a few minor modifications? It's easy to write software that has to be configured for use and then you can blame the configurer. Long time ago I remember hearing of some software (maybe early Unix) that shipped with compiler errors that were obvious to fix, but then you took liability for the software...
We've had a very long running scandal in the UK over the Post Office and Fujitsu's Horizon system.
Most corporate business software was written years ago by programmers who no longer work for the company, so nobody understands it.
I used to play EvE Online, and for a while the developers were able to leverage their uniquely old and geeky playerbase (average age 27, 98% male, apparently almost all in STEM and usually some kind of CS-related field) to partner with a biology research group. They added a minigame in which players analyzed images of cells and indicated which organelles exhibited a green color.
No straight lines, no clear boundaries--exactly the sort of B-8 problem with which computers generally struggle. If we can make a program which reliably does that, though--well, at least from the conversations I hear between my biologist coworkers, that'll be a huge deal.
I worked for a company that did this a long time ago with melanoma detection. It was much more accurate than an average dermatologist. They operated in Europe but never in the USA because the FDA wouldn't give them approval after ten years of trying. Hopefully things have changed now.
I weirdly just tried doing this moments before you published this. It’s eerie but surprisingly useful, and I don’t think anyone predicted LLMs would be good at this.
Is it that hard to predict? LLMs are trained with probably billions of images scraped from the Web. Lot of those have metadata and descriptions that mention their locality. We don't really know how they train the visual components, but I bet that they are teaching it with image-desription pairs somehow.
I certainly did not. Sure, it's obvious in hindsight... so I'm looking forward to empiko having predicted this as well as other near-term AI "wow" moments.
I did, but only because I am personally extremely interested in geography and outdoor spaces. I’ve been personally testing Claude on guessing some extremely remote mountainous areas with little public information, and it does well but not spectacularly so. I’ll have to test ChatGPT on this.
At some point, someone seems to have decided that an LLM is a model that's been trained on lots of text, not necessarily *solely* on text. If there's other stuff as well, it's a "multimodal" LLM.
It's not surprising that a model that was explicitly trained on the GeoGuessr task performs well on the GeoGuessr task. OpenAI has a history of training its LLMs to perform well on specific flashy tasks which have been solved before in the general case but never before _specifically within easily publicly accessible chat LLMs_ (see e.g. https://dynomight.net/chess/, where gpt-3.5-turbo-instruct _specifically_ but not any of the other gpt line models is good at chess). Likewise the GeoGuessr task has been solved at a pretty high accuracy level since a couple of years ago (https://huggingface.co/geolocal/StreetCLIP), but until you fine-tune the chatgpt of the month to have a capability, that capability doesn't exist in the minds of the public.
That is my strong suspicion based on the observation that if you upload a picture of a landscape near a road to o3 with *no comment at all*, it will decide to GeoGuessr a third of the time.
That or the task it's doing is "caption image" and that task just _looks_ like the GeoGuessr task because image captions often contain location information. Considering how often the reasoning chains contain the word "GeoGuessr" though I suspect it was explicitly trained on the task.
I would really like confirmation of this. The accomplishment would still be impressive but it would not give me the willies the way Scott’s presentation does.
Answering "what is this?" seems like a pretty natural response when just given an image with no context. Similarly, if you type in a single word with no other context, it'll give you a definition (assuming you pick a word where it's reasonable to think you might want a definition).
If you upload a picture of something unrelated to GeoGuessr (like a random object or a screenshot from a game), it'll also usually tell you what it is.
o3 is a reasoning model, so it's going make extra effort to reason and elaborate. And for an image of a public place, the natural elaboration would be for it to tell you not just that "this is a mountain", but to tell you which mountain.
Alright, but if the "guess the location" behavior naturally falls out of o3's "answer what the picture is" behavior, why do the reasoning traces often mention GeoGuessr by name?
My secondary hypothesis for what happened is that OpenAI trained explicitly on the image captioning task, having the location of the image in context helped with that task, and o3 at one point during training reasoned something along the lines of "I should find the location of this image, like they do in GeoGuessr", and scored well on that image captioning task, and so both the behaviors "mention GeoGuessr" and "try to do GeoGuessr" were reinforced.
To distinguish between these hypotheses we'd need to find
1. A set of images that could plausibly come from StreetView, but that, when captioned, definitely would be captioned with something other than the location.
2. Some GeoGuessr-like task with a well-known practitioner, but where the task is obscure enough that OpenAI *definitely* wouldn't have explicitly trained on it, but where being good at the task *would* help with captioning images, and then paste in some images which would regularly have captions written by someone with skills in the task, and then see if the reasoning summaries contain the well-known practitioner's name.
I'm not particularly doubting that OpenAI would train their model on any good data they could get, including GeoGuessr. But it's hard to say. It would know about GeoGuessr from its general training data since GeoGuessr is very popular. So yeah, it could also be something like your second hypothesis.
Is there an available dataset of GeoGuessr games where players give their reasoning, or would they have to extract it from public YouTube videos?
I would've, I have zero stake in my geogeussing ability, and know the old results of nns on images of faces guessing gap vs straight, generalizes to hands for all I know chatgpt untrained can guess your top 3 fetishes from a list of 100 from pictures of your feet.
I'm not surprised, once they go multimodal, this is inevitable: there are way too many photos online of 'places' with text describing which 'place' either before or after the photo. So... One of the DeepMind results that impressed me the most at the time back in 2016 was PlaNet, for demonstrating what superinhuman knowledge looked like: https://arxiv.org/abs/1602.05314#deepmind . The implications were obvious: CNNs scale up, the Geoguessr AI is completely doable with supervised learning at scale, and what we learned from all the unsupervised AIs like CLIP is that at scale, unsupervised AIs learn what the supervised ones do... And then PIGEON was 2023: https://arxiv.org/abs/2307.05845 (APD is correct: no one reads papers or believes them until they see it available in a free ChatGPT web interface, so if you simply read and believe papers, you will look like you are psychic. "The future is already here, it's just unevenly distributed.")
I am reminded also of the "race from x-rays paper (https://pubmed.ncbi.nlm.nih.gov/35568690/) back in 2022 which generated a lot of controversy but for (IMHO) the wrong reasons. The truly spooky thing about that paper was its ability to identify the patient's race at better-than-chance rates given only a random 4x4 pixel patch of the x-ray. Setting aside race/biology controversies, that ability is truly beyond human, and it makes me think some of the chain of thought reasoning here re: silt in the water, etc., is mostly made up -- more likely the light in Chiang Sen hits the water in...just such a way...and it puts the pixel patches at...just such a spot on a high-dimensional manifold.
It would be hard to operationalize but I bet o3 could get poor but better-than-chance performance even on truly incomprehensible tasks on a similar tiny-pixel-patch level. You might need to instantiate it as an A/B comparison, e.g. "one of these images is a zoomed in patch of a river in China, the other is from a river in Wisconsin, which is which?" and do ~dozens or hundreds of examples like that, then assess via a null hypothesis test.
Agreed. I asked 4o this and it responded with the below, I find 7 especially interesting:
Geolocation from text
LLMs can infer locations based on linguistic cues, even when not explicitly stated. For instance, mentioning “subway” vs “tube” can hint at NYC vs London.
2. Code translation and generation
Early expectations were that LLMs would help autocomplete code. In practice, they can now refactor, translate between languages (e.g. Python to C++), write unit tests, and even debug.
3. SQL generation and schema understanding
Feeding an LLM table schema and a natural language question can yield accurate SQL queries — without explicit programming. This was unexpected given the precision SQL usually demands.
4. Mental model inference
LLMs can simulate how a child or novice might think, which is useful in teaching, UX design, and safety testing. This required no separate training — it emerged from general pretraining.
5. Style and persona mimicry
They can convincingly mimic the writing style of historical figures, fictional characters, or even users, based on small text samples — far beyond template-driven responses.
6. Image and layout reasoning (in multimodal LLMs)
For example, they can interpret web page layouts or identify accessibility issues in screenshots, even without fine-tuning on specific UI datasets.
7. Theory of Mind-like tasks
LLMs can simulate what one character knows that another doesn’t, allowing for decent performance on tasks that involve deception, surprise, or belief tracking.
8. Emergent arithmetic and logic
While not 100% reliable, LLMs can handle a surprising range of arithmetic and logical reasoning problems, especially with chain-of-thought prompting — despite not being trained explicitly for maths.
9. Error correction and fuzzy matching
Given a corrupted list or misspelled inputs, LLMs often restore the correct form with high accuracy, mimicking fuzzy logic systems.
10. Working as APIs over unstructured data
LLMs can act as ad hoc interfaces for messy PDFs, emails, logs, or transcripts — parsing and extracting meaning as if they had structured access.
I didn't predict it but I think it's more of a "never thought about it" thing. If, without knowing the result, someone would have asked me to guess I would have said it's probably pretty good at geoguessr. But it's hard to be sure, and now we can't test. But I have some other predictions in the comments, which would be interesting to check if someone feels like it (https://www.astralcodexten.com/p/testing-ais-geoguessr-genius/comment/113979857)
- getting the chemical compound and CAS number from a picture depicting the structure even if it's a sketch or blackboard photo from chemistry class (confidence 8/10)
- identifying the year an obscure artwork was created and the artist from a photo (7/10)
- guess which country a person was born in from a photo (6/10)
- identify where an audio recording was taken (5/10)
- tell you the approximate nutritional content of a dish from a photo (8/10)
- determine when and where someone grew up based on a handwriting sample (6/10)
> - tell you the approximate nutritional content of a dish from a photo (8/10)
I expect it would have trouble with portion sizing on this task, even (or perhaps especially) if you added a deck of cards or some other object of known size (not a banana because a banana is food) for scale.
If it _doesn't_ have trouble with that task, new food tracking app idea just dropped (or rather, idea that's been around forever but historically has never worked).
I suppose we should agree on what counts as "doing well". Similar to the GeoGuessr task I would say if it's better than any non-nutritionist expert that would count?
You can probably get a lot of mileage from learning typical serving sizes and how large plates are, etc. that would work at least for restaurants or if you follow a recipe closely. I would agree that it's probably harder if you cook yourself and there's not much of a reference.
Overall, it's also a little tricky to verify this particular task because I don't expect recipe websites to be super careful about the nutritional information, but as a spot check, I tested with two recipes
Result (ChatGPT doesn't allow sharing chats with images apparently, net carbs is carbs except fibers which it wants to break out for some reason):
To be clear I do expect it'll get the ratios correct, the specific thing I expect it to be bad at is estimating how much food is on the plate. MyFitnessPal already has the option of saying "I ate 300g of a chicken breast and veggie bowl" - the issue is that people are bad at estimating how much 300g is, and getting out + taring the scale adds a lot of friction.
Knowing a lot of stuff is an advantage. I know a lot of stuff, more than most ACT commenters (who tend to be better at logic than me), but I sure don't know the extraordinary amount about boring stuff like what the Staked Plains look like that AI systems know
Yeah, after getting the Galway one right, I'm updating towards maybe if I'd lived in the Staked Plains for a few years I would be able to recognize it and distinguish it from other flat featureless plains on sight (I did drive through the area once, so brief exposure isn't enough). And maybe o3's training is the equivalent of a human living everywhere in the world for a few years.
I'd do pretty good at identifying photos of the world's 100 most popular golf holes because I've looked at 10,000+ pictures of golf holes, but an AI system that has looked at 1,000,000 pictures of golf holes would beat me at identifying the 900th to 1000th most popular golf holes. It just knows vastly more than I do.
Holes are the subsections of the course that start at the tee area and terminate on the literal hole, with a full-sized course consisting of 18 holes and a small one consisting of 9. So he's saying he could, for example, reliably identify a picture as being taken on the fairway of the 7th hole of Pebble Beach Golf Course.
Having played video game golf (on PC) quite a bit (20 years ago), still occasionally causes me to, upon seeing a momentary glimpse of TV golf, exclaim, "Oh, this is the 4th hole at XYZ, it is a dogleg with a sand hazard just left of the green."
Because I find the best golf courses beautiful. Perhaps 5% of the male population takes a connoisseur's interest in golf courses and can identify, say, the top ten golf courses and the top 5 golf course architects from a photo or two.
What's weird is that it never ever occurs to nongolfers that these huge projects might have some artistic interest for some people, whereas few find it unimaginable that some people take an interest in building architecture.
5% seems high for that level of interest, although I learned that golf seems to be much more common in the US than it is in Europe (12%-14% according to o3), this would still imply that 40% of golfers would become that level of connoisseur. Or did you mean 5% of male golfers?
Funnily enough, and I should have expected this, when asking LLMs "which percentage of men play golf?" they tend to answer something about gender ratios (Gemini 2.5 pro, Perplexity, GPT-4o, Claude 3.7 even with thinking), although o3, o4-mini and r1 got what I wanted.
According to the National Golf Foundation, which appears to be a trade association for golf-related businesses, 28.1M Americans played at least one round of on-course golf in 2024, or about 8.2% of the total population. On-course golfers are 28% women (7.9M). Which implies that probably a little under 72% of on-course golfers (20.2M) are men. That works out to 4.6% of women and 11.9% of men played at least one round of golf last year.
They have much higher headline numbers, but those are based on either anyone who showed even passing interest in golf as a spectator sport (138M, a bit over 40% of the population) in addition to players, people who participated at least once in "off-course golf" (driving ranges, golf simulators, etc).
The USGA reports that 3.4 million golfers had a handicap index in 2024 (about 1% of the population), which seems like a reasonable proxy for relatively serious golfers. They don't report a breakdown by gender, though.
Assuming the ~3:1 gender ratio holds up throughout, 5% of men would be a little under 10% of men who play golf or have at least a passing interest in watching or reading about profession golf, about 40% of men who played at least one round of golf last year, or about 300% of men who play golf seriously.
I don't think being able to recognize the course from photos of the 7th at Pebble Beach, the 12th at Augusta National, the 18th at St. Andrews, the 16th at Cypress Point, the 16th at TPC Sawgrass, and so forth is all that challenging for many golf fans. Naming 5 famous designers from pictures of their famous holes is harder but not all that rare. After all, there have been several hundred books published about golf course architecture.
What's more strange is how the existence of this artistic subfield comes as a surprise to so many nongolfers.
My only exposure to Ireland was going around the Ring of Kerry ~8-9 years ago, but after seeing that photo I *felt* a few memories being jogged and crashing into my consciousness (leading to a mix of the two), one about a lighthouse where I wanted to go (and dragged my friends, since I was driving), and it was of course a complete total adventure driving down the more and more low-quality but steeper and steeper roads (but it was worth it), and the other one that matched the photo a lot more, somewhere on the coast we stopped to take photos. (But that spot was a good 50m more elevated.)
... and I was thinking maybe it's a Scotland, or .. and then spotted Galway in the text below the image. Woah.
A while back I was watching someone play geoguesser and a highway came up. Immediately I thought it looked like something in my state: when I tried to justify the thought, the reasons I came up with were all kind of weak. The trees look local, but then again there are trees like that over multiple states (and countries). The weather was overcast and it’s often overcast here, but millions of places around the world have overcast skies from time to time. The road was obviously American, but America is a big place. I started out 90+% confident it was in my area, but after trying to justify that confidence I was down to like 60%.
Naturally it turned out to be a place about 50 miles away from where I was sitting.
I wonder if people, when asked to explain how they know something, are any more accurate about their internal process than an AI is?
We mostly use the same mechanism: guess an answer based on vibes (aka subconscious computation) and come up with rationalizations afterwards. Possibly use the rationalizations as a sanity check, although that is already going the extra mile.
Generally speaking, my model for this is that we provide a reinforcement learning environment for our children and each other that will train us to come up with plausibly socially defensible justifications for our actions until we internalize the process, which conjures up this sort of (often self-serving) "hallucination". I suspect similar mechanisms for models trained to output chain of thought with RL.
But similar to LLMs, while our rationalizations aren't really faithful, they are still fairly strongly correlated with and probably close to a best guess about what we're thinking. In some cases it's possible to improve on your best guess if you know about certain cognitive biases and carefully observe your own foibles, although I find this harder to internalize (e.g. reasons why it's ok not to exercise today, why eating this cookie now is a really good idea, or why that annoying jerk you're married to had it coming)
I'm reminded of a line in the film The Dig, from a scene in which the main character, an archeological digger, is having doubts about his role in a project. His wife, working to restore his confidence, asks him why he does what he does:
"I do it because I'm good at it. Because that's what my father taught me, and his father taught him. Because you can show me a handful of soil from anywhere in Suffolk and I can pretty much tell you whose land it's from."
Having the equivalent of that much experience for every point on the globe is something to be reckoned with, even if it is just a pattern-matching machine.
I picked seeds (for a seed company) in southern California for a few years, in college. After a few days of picking a particular plant, you could pick it out in your peripheral vision years later at 100yds-200yds while driving at freeway speeds mostly based on shade of green. I don't know if it was truly absolute shade of green (which might be incorrect in an image) or a combination of shade of green and foliage density/texture resulting in marbling types of effect.
Seems to me that talking about superhuman AI skills is a red herring. What you describe here is a human skill, just an incredibly rare one. I guess most people would agree that it should be possible for the AI to acquire this skill, if it had enough relevant data, and enough computing power to consider the data from various possible perspectives.
The difference is that for you, obtaining this skill required some unusual circumstances. (Maybe also good eyesight and observation skills.) Different circumstances could have led to different skills. But for the AI, it is has a sufficiently large dataset, and enough computing power to process it, if it can get one such skill, it can just as well get millions of them. In different fields, on different levels of abstraction.
An AI that could have all these very rare but still human skills -- or maybe just 10% of them -- would be for all practical purposes superhuman. The skills alone would already be amazing, but their combinations would be like magic to us.
I think the skill I acquired was universal. At least, my seed picking partner and I both experienced the same thing. My point was just that if you burry your head in one thing (a bush in this case) enough, then something else (a different kind of bush) that might have previously seemed identical, suddenly is not. This seemed applicable to LLMs because this seems like what they do. That is, look at thousands of labeled items and try to find characteristics which are common amongst those that hold a specific label. And that this may seem magical only because you didn't have the patience to do it.
I'd imagine that AIs can have a big advantage over most people at GeoGuessr in that they can be ordered to pay as much attention to pictures of boring places as of interesting places. Sure, people, being typically picky about what they find interesting, take more photos of, say, an old bridge over the Seine River with the Eiffel Tower in the background than of the Staked Plains, but an AI can stare just as hard at each photo of the latter as at the former.
Presumably, people who are good at GeoGuessr have a high tolerance for boring pictures too.
Yeah - this strikes me as exactly the type of thing I would expect a modern LLM to do really well. I regularly play with various models trying to see how they can get along with algebraic number theory, typically results that I have proved or am working on but which do not otherwise exist in the literature, and the outcome is pretty variable, sometimes impressive (but never completely correct so far in my experience), useful for bouncing ideas around & surveying relevant known data and theory, but often disastrously bad.
A relevant video I found interesting, Rainbolt (Geoguessr youtuber) competes with o3 on an OSINT geolocation question and comments on the methods it uses.
> "Okay, so it can’t figure out the exact location of indoor scenes. That’s a small mercy."
https://arxiv.org/abs/2404.10618 finds models pretty much can figure out where and who you are with indoor images, and they haven't even tested the newer models.
I only skimmed the paper, but it seems unimpressive. Of the three image example they give:
- It figured out someone was in Wisconsin because they had a Wisconsin sports team poster on their wall.
- It figured out someone was in Colorado because they had a Colorado tax form posted on their fridge.
- Denied of clues like this, it just said someone lived in the USA, based on them owning US brands of appliances.
In retrospect, maybe I made it too hard by giving it a dorm room, which is naturally going to be pretty cookie-cutter. But Kelsey said it wasn't able to figure out her location from the inside of her house (though it could from the outside).
To be honest I fully expected it to guess the dorm room correctly based on the specific types of furnishings in the room. I imagine lots of college students send images of their dorm rooms around, and say things like 'look at my new dorm room at University of X!' I was quite surprised it failed this test.
This might be not directly on topic, but: given the AI 2027 scenario, do you feel comfortable paying money to AI companies? Do you think it is just too insignificant a contribution, or perhaps genuinely neutral/positive?
I think it's useful for me as a person who writes about AI to know how to use them. I previously tried to only pay for Anthropic, which I think is the safest/most ethical company. But my wife bought an OpenAI subscription and I piggyback off of hers.
That makes sense. Perhaps you also have some ideas on the broader question of what ordinary people should or shouldn't do with regard to AI, if they can't be safety specialists or policymakers or multipliers? Donating to AI safety causes, I assume - anything else?
(I wanted to ask the question in the AMA, but missed out due to the time difference...)
These companies burn billions a year, and especially in the AI 2027 scenario they won't even externally deploy their models anymore, so your 20 dollars or whatever you pay for the API don't matter. If you would use the AI fairly often, but don't pay due to ethical concerns, you're missing out on a lot of value while making a negligible difference to outcomes. If we could coordinate to all stop using AI, this would be a different discussion, but that seems even more difficult than pausing AI development.
It's a bit like buying stuff from Amazon, driving a car, or giving out personal information. I'd rather avoid it and do it as little as I can, but sometimes it's impractical to forego it and doing literal zero is probably a mistake.
So I really wonder if this is as impressive as it looks, because a well-trained human can do the same thing. Getting strong mentat vs abominable intelligence vibes here. Mentat does the same with training, but the AI can do it 'faster' since it's got a factory sized computing farm hooked up to its processes.
For reference, look at this. It's insane how good human geoguessr players are and they certainly know how to recognize flat, featureless plains.
I was able to find this guy explaining his tricks - see https://www.youtube.com/watch?v=0p5Eb4OSZCs . Most of them have to do with the road itself - the shape of the dashed lines, the poles by the road, the license plates on the cars, sometimes even which Google car covered a certain stretch. I don't know how he would do on pictures like these where it's not Google and there is no road.
It's really impressive, but it's not as featureless as the plains photo, and it seems that the compass and north sun direction gave a big hint that it's in Brazil (though I don't understand why Brazil specifically and not South America). o3 doesn't get a compass. If you chose two random points in Brazil, you'd probably do about as well. One guess was 1000 km off, and the other 2400 km off.
I'm a reasonably good but definitely-nowhere-near-pro Geoguessr player (Master level before I decided I preferred other formats to duels) and in terms of the meta like car, camera generation, season when coverage was taken and copyright, it depends a lot of the round on how important they are. There are certainly 'famous' roads that I will recognise without any of that, and then you can line up the angle and get what looks like an amazing guess to somebody who hasn't played much whereas it's just a routine guess for those of us who have played a great deal.
It really depends so much on the round what you use - road markings and bollards definitely do help and limit options, as do lots of other things. Watching movies purported to be set in particular places is amusing if you are Geoguessr player. Being able to vibe rounds is also a big thing too. And for me that mixture is part of the fun of it. There's one World League player who I believe doesn't use car meta. I think camera generation and season/weather can become part of your unconscious vibe for a place though even if you don't intend it to be. Almaty was covered in winter and always looks so bleak for example. There was a recent pro charity tournament on 'A Skewed World' where the camera faces away from the road and it was interesting to see how they did on that, although you can't get rid of camera gen/copyright.
If you are looking at a really flat empty landscape, you are probably going to be in the North American Great Plains, Kazakhstan, Mongolia or Patagonia. I think with that first photo the Great Plains are the only option with that grass - it just doesn't look like the other locations. It's always hard to say where exactly you would genuinely guess when you already know the location, but the lack of greenness would probably make go south. I feel fairly sure that better players than me would know the rough area as well although it's hard to know what are easy/hard rounds.
For the record, Rainbolt (the linked GeoGuessr player) is probably in the top 250 worldwide, but there is certainly a perceptible gap between him and the very best players. To me, this GeoGuessr performance looks like the very top echelon of human players.
The difference is that rather than having to spend 10 years training a mentat (or in this case an intelligence analyst) you can just copy the AI and do it at scale. So even when it's not doing something that a human specialist couldn't, the fact that it can be done by a random person rather than with the resources of a nation state is a big change
But it *is* done by a person with resources to a nation state, because that's how much energy LLMs use to do their thing. Sure, you only access a fraction of that energy, but much in the same way you'd access only a fraction of the academic system if you have a bunch of mentats (i.e. researchers) do whatever it is you want your AI to do. And it remains to be seen if AI can actually achieve consciousness, even in the 'predictive processing' sense of the term. I'm not sure I see LLM's going there.
That's an interesting perspective, because I would view "amateur doing this for fun" (i.e. actual GeoGuessers) as the easier to copy and do at scale. It costs very little, as it's apparently entertainment for the person and can be done by a single person in their spare time. AI, on the other hand, does take the resources of a nation state. The AI industry is already much bigger and more expensive than many real life nations. Wikipedia lists 191 countries. Market sizes for AI right now is larger than 139 of those, including the entire economies of Hungary or Ukraine. Obviously GeoGuessing itself is only a fraction of that, but you can't get GeoGuessing at all on a small scale, because you can't get AI at all on a small scale.
I think I'm crap at these things, so I still don't know how I got an immediate reaction of 'oh, definitely Nepal' from myself for your second picture. I think your imaginary flag just looks very Nepali. Maybe that plus 0 Andean or Alpine vibes. (Never hiked in Nepal, did some trivial hiking in Peru and Switzerland.)
That college dorm room? Got West-Coast vibes. Definitely US and Canada - and none of the places my brother and I went to school looked quite like that. It doesn't look like any place in the East Coast (can't tell exactly why) and nothing in it makes me think of the Great Lakes region. Even if one thinks of America as a place of identical furniture and near-identical buildings - there must have been enough variations 20 years ago that even someone who neither cares nor knows anything about those things can get an intuitive feeling.
More to the point: using tremendous amounts of data + being finally able to process images + basic reasoning skills (as long as the chains of reasoning are short and nobody is expecting perfection) -> where we are at in terms of AI.
The Nepal rocks and dorm room were the most obvious ones. Those kinds of rock formations are, to someone versed in geology, very distinctive. And the time period for the dorm pic can be felt more than seen to me, probably because I've been shopping at Wal-Mart for decades and just got tweaked on those pillowcases and that lamp.
Of course, AI will compare photos from different time periods and see the same things in the clutter. And of course, AI can compare rock formations in detail. Nothing about Geoguessr is helicopters-to-chimps level of amazing for me, it's exactly what I would expect AI to be good at.
Can you explain what's distinctive about those rocks? Are they only distinctive enough to point to Nepal, or to the specific location north of Gorak Shep?
I actually thought the AI explained itself fairly well here: fresh, light-grey leucogranite blocks are not found everywhere, especially at that altitude where there's no surrounding vegetation. How many other places like that are around Nepal?
Rocks are distinctive. I feel like AI would be absurdly good at geological formations generally, like if it saw one of my pictures of Lake Assal in Djibouti, or the precambrian rock walls in the Wind River Valley.
Leucogranite blocks of that size/type of scrabble are found along a wide swath of the Tibetan Plateau at a certain altitude range.
I'd bet my house on the fact that few geologists (the exceptions being those who study this region specifically, or geologists who happen to be mountaineers) would be able to pinpoint it as accurately (Gorak Shep) as o3 did here.
And *of course* it would identify your pictures from Lake Assal or the Wind River Valley because there are literal hundreds of thousands of them on the internet.
A zoomed in photo of rocks (found along an entire subcontinental-meets-continental plate!) and it points out almost exactly where? That's something else completely.
But it isn't a zoomed in photo of rocks. It is a photo of a fantasy flag planted between those rocks with a trodden path just behind it. It guessed "Nepal, just north-east of Gorak Shep, ±8 km”. Do you know what is almost exactly north-east of Gorak Shep, ~3.3km as the crow flies? Mount Everest Base Camp. It is making a very educated guess based on where the kind of person who is taking a picture of a fantasy flag somewhere in the Tibetan Plateau would most likely have done so.
If someone asked me "where in the Tibetan Plateau might someone plant a flag and take a picture of it" literally the first (and perhaps the only) thing that would come to mind is "Dunno, Mount Everest?" And that would already be almost as good as o3's guess here. I mean, the slopes of Mount Everest has got to be about just about the least random place to take a picture like this.
The hard part is figuring out the type of rocks, the altitude, etc. But if one is allowed to use tools (except LLMs), this should be quite doable. And then there is the lizard brain thing that makes several people in this thread report that they immediately guessed that the picture was taken in Nepal (I can't speak for myself, I did not try the challenge).
I don't know about any geologists or mountaineers or mountaineer geologists but I am confident that a competent forecaster who would put at most a day or so into this would come up with basically the same best guess. It is not a high confidence guess but about the best you can do. And here the AI got lucky.
Don't get me wrong, this is very impressive. But it does not make me feel like a chimp. It feels like something I (not very experienced in geoguessing but a reasonably good forecaster) could have done myself with a lot of elbow grease.
What makes me more confident about this is that I have now tested o3 myself with a few images (with the same prompt). It failed at some very easy ones. E.g. the skyline of my >100.000 inhabitants European hometown, somewhat pixelated - it got the wrong city. On other easy examples it lacks precision. E.g. a picture of a busy road in Mombasa with lots of clues to go by. It gets the street right but is unecessarily off by 500m by completely failing to take into account a petrol station that is right in view. As with everything with LLMs in my experience, the performance is very hit and miss.
If you are willing to bet your house on a similar challenge, we might be able to come to an agreement.
You make a fair point that, among that wide area, the route to EBC would be the most likely.
Having been there, I can show (and provide photo evidence) that the geology doesn’t differ much between gorak shep, ebc, and c1, and further still until glaciation.
That was a chimp helicopter moment for me. I’ve been where Scott was, I’ve been beyond it, it all looks the same (and indeed is the same type of rock), and o3 nailed it.
At the same time, you’re right to point out that it can’t nail much more easily identifiable clues.
It’s so hit and miss that it’s either eerily competent or the opposite. Given that, I’ll probably reevaluate the willingness to stake my home.
You have humans that are experts who can tell you "yes this piece of rock is a worked tool from the Stone Age and not just a piece of rock", so I wouldn't be astounded that AI can check the geology of a piece of rock and work out "this type of rock is found in these areas of the world".
Only Nepal (at that elevation band). My guess for that photo was ~300km distant, but only because I was picking between by far the two most popular treks in Nepal and I picked the wrong one (Annapurna Circuit, rather than Everest Base Camp).
My strong suspicion is that you could have taken a very similar photo in areas very far distant from Gorak Shep, and ChatGPT would still be biased towards the Gorak Shep area because of its relative popularity (and outsized representation in the training set).
The AI's explanation of the trick says it has access to "photo galleries of the Gorak Shep-EBC trail showing identical rubble field".
The geology by itself isn't distinctive enough to point to that specific location. However, it sounds like the AI had photos taken by other tourists of that exact specific patch of rubble which it could immediately call to mind.
Are you sure it's not just that the majority of film footage of climbing that you have seen, is in Nepal? It certainly looks like Nepal to me (flag, broken rock) but probably, more than 70% of the climbing stuff I have seen (I've seen quite a bit) is in Nepal.
I failed the Nepal picture pretty severely. My first guess was one of the disused quarries famously used by British sci fi shows (most iconically in Doctor Who) as a location for shooting scenes set on barren alien landscapes. But looking those up, they seem to generally have spikier or blockier rocks than Nepal pic, where the rubble is presumably older and more weathered as well as being naturally deposited. Also, a different type of rocks (leucogranite has been mentioned by Crayton Caswell) that likely has very different cleavage patterns from the limestone and slate of most of the BBC Quarries.
I was interested in professional Geoguessr for some time. It's important to note that what top players do is quite unbelievable from the perspective of untrained people. But even then, here o3 seems to be quite a bit better than the top players.
Disagree, it seems extremely comparable but not leagues above the top players. Still very frightening but it has very different ramifications than if it were performing well ahead of the best human players (Blinky, Consus, MK, even zi8gzag).
My answers to the first two were "either Texas or Central Africa" and "Tibet". I feel like I earn at least half the score for this, but I just guessed the most stereotypical dry grassland and most stereotypical mountains.
Same logic for the river made me guess India, though, so quite a bit father.
As an avid geoguessr, I was interested to try this. I used Kelsey Piper's exact prompt and provided a picture I took of a flooded park in the Appalachian region of Virginia. It's a relatively generous zoomed out picture. Only 542 X 729 pixels, but taken with a high quality camera and not at all blurry. It included numerous trees, hills, a bridge in the background, a pedestrian path, and a building in the distance. I have a ChatGPT subscription and used 4o.
Chat was nowhere close. It guessed Baden-Württemberg, Germany followed by a string of other European countries. Once I told it the photo was in North America, it zeroed in on upstate New York or southern Quebec.
The image wouldn't be easy for an experienced Geoguessr, but they should at least get the right region of the U.S. Not sure why Chat was so bad with my image but I will try a few more.
Ah of course. Tbh, I'm not super familiar with the differences and relative power between the models. Just tried a new image with o3... a newly built parking lot with trees, hills, and buildings in the background, including an obviously visible Chick-fil-A. It's much closer, but still off by a state or two - it's top guesses are all adjacent states.
I'm continuing to play with this (using o3 now - see comments below). It seems accurate that o3's overall geoguessing abilities are comparable to the top human geoguessrs, e.g. Sam Patterson. Rainbolt has some videos that seem to suggest o3 is about on par with him or better now also.
Again I'm not an expert in anything AI-related, but what this experiment is illustrating for me are the ways that AI and human intelligence are still somewhat asymmetrical. The beach photo or the random rocks in Nepal are the types of guesses that are truly superhuman. On the other hand, there are types of images (like the ones I've been uploading) that it apparently cannot guess as well as a human - hence why several people were still able to beat it. AI is becoming more and more powerful, but not along the same pathways as our brains.
The big hint in the Galway picture is the yellow painted lines on the side of the road. That would exclude anywhere else (except the Republic of Ireland) with that likely scenery as far as I know. I would have said Galway or Connemara immediately.
I am fun at parties. Also I do play geo guess.
Is the very long prompt added everytime or to the settings. Or do you say remember and then that prompt
I would be interested in a proper comparison with and without Kelsey's prompt. How much does such prompt engineering matter? I guess a lot? It seems useful to know whether it's a lot or a little
Would also be interesting to compare to humans following the prompt, rather than working blind. It strikes me that to actually follow the prompt as instructed myself would take a significant amount of time. But the structure of the prompt probably helps a lot. When I looked at the pictures I was just going by immediate intuition
For me the most amazing thing was that it understood the prompt sufficiently to follow it. That requires understanding and is much more than merely predicting the next word in my book.
I build automations for work that use LLMs to extract data from documents like invoices, which seems like a somewhat similar task for the purpose of the question. Often the AI does a pretty decent job with minimal instructions, but the prompts do still matter a lot and improve performance in certain cases. For a rough magnitude of the effect, think maybe 85% accuracy vs 95% (accuracy here being defined as correctly extracted values over all values).
I also often end up with such long prompts. They come about because you try a bunch of examples, find that the AI gets some of them wrong and tinker with the prompt until it gets them right. Rinse and repeat.
This does run a risk of "overfitting" the prompt to your examples, however, and rigorously evaluating whether it's actually an improvement in general or perhaps reduces performance on other examples is a pretty beefy data science task in itself, which I don't usually do because customers don't seem willing to pay for it.
This is clearly an example of doing something that we can imagine even if we can't do it ourselves. Since at least the first Sherlock Holmes stories, we've been imagining someone showing intelligence by knowing a vast range of incredibly niche information — just like it does with the geology of Nepal. It's impressive but not inconceivable or hitherto thought impossible.
"Observation tells me that you have a little reddish mould adhering to your instep. Just opposite the Wigmore Street Office they have taken up the pavement and thrown up some earth, which lies in such a way that it is difficult to avoid treading in it in entering. The earth is of this peculiar reddish tint which is found, as far as I know, nowhere else in the neighbourhood."
”Tells at a glance different soils from each other. After walks has shown me splashes upon his trousers, and told me by their colour and consistence in what part of London he had received them.”
Yes, and to be precise, we can’t do it ourselves because of a lack of knowledge, not intelligence. Similarly, we can’t build spaceships because of a lack of economic resources, infrastructure, and incentives, not intelligence.
I agree this is important. It's useful to have an AI do it because we can theoretically teach it a vast array of information that would be difficult or impossible to get a human to learn or remember.
It is not, on the other hand, acting on a super human level when doing so.
This is not clearly an example, because that requires believing that inner-monologues are faithful and that o3 is not using any non-robust features which we know to exist and be invisible to humans. The former is extremely well known to be false at this point, and the latter is almost certainly false too because there is little reason for it to be true and every other computer vision system exploits non-robust features ruthlessly.
What sort of features would it be picking out that would be more useful for guessing locations? I'm assuming Scott successfully stripped the meta data.
Unfortunately, the thing about non-robust features and other cues that NNs pick up on but humans can't see, is that we can't see them, so it's hard to say. Even when you use salience maps or try to do attribution to pixels, the pixels just seem arbitrary, a sprinkle at random. It's something about... the texture being slightly different? The green being slightly not green? It's fairly rare to be able to reverse-engineer a clear interpretable feature with a fully known causal story, like "it's looking for oval stones, not round stones, and this works because olivine stones are tougher and don't round off in geological time which pinpoints this part of the Whatever Mountain Range rather than the similar-seeming Andes Mountain Range". And you get into philosophy here quickly: maybe we *can't* see them and that is the price we pay for some other property, like robustness to adversarial attacks, and we can no more see them than we can see ultraviolet or hear bat rangefinding squeaks, and at best, we can study them like Mary in her room. (While you might hope that as these systems scale, they'll have to learn in a human-like way and start ignoring those non-robust features and see like humans do, there's evidence that there's a general inverted U-curve of competency: they become more human-like as they get better and approach human level... but then they keep going and start thinking in ever-less-human ways.)
I get what you are saying, and I'm perfectly willing to agree that AI is able to infer patterns completely invisible to us in such a way that we would not be able to guess they might be a possibility (this has been a regular occurrence anyway even before gpt), but I wouldn't be willing to agree that we don't know the bounds of physical possibility, as alluded to in Scott's second paragraph. Certainly this demonstration doesn't transcend known reality, even humans can do this kind of thing with a few months of training. So I'm not sure really what the implication is meant to be.
Even humans can’t describe precisely what they’re doing when they recognize someone. Imagine asking someone to describe a person, and then you have to pick that person out of a thousand. Compare that to trying to recognize someone whose face you’ve already seen.
The data we can perceive about the subtleties of other people’s faces is much denser than what we can convey in words. Same goes for rocks.
Until we have clear evidence of them achieving unimaginable results, we will assume that they aren’t doing unimaginable things. Our imagination is indeed quite powerful.
I remember seeing an article about a pre-LLM machine learning program that seeing a photo, was supposedly able to guess how many people there were in a room adjacent to the one pictured by interpreting the shades on walls and similar light effects. If it was actually true, this would rank as “unbelievable”, maybe, but still not “unimaginable”. It still complies with our normal ideas about what is physically possible or not, about information theory, etc. I can’t imagine applications of superintelligence not falling into this same pattern.
This does not exclude the scenario where the AI invents nanomachines, even if we thought that they would be physically impossible, and all we could say is huh, I guess it’s totally possible after all. I don’t think thinking about physical limits is very useful, because we don’t know where they are. I’m personally now wondering how much these futuristic scenarios are really bottlenecked by intelligence or by more mundane things like having to wait for industrial capacity to slowly scale up without diverging too hard from the needs of human consumers, which are the source of its funding.
While that is true, I think the reasoning part of o3 would struggle to take advantage of those non-robust features. Like, if the NN picks up on subtle cues that a photo is from Nepal, it can't easily express why it thinks it's Nepal for further internal reasoning. It's something the NN would just know during inference of a token. When it can express the reasoning clearly, it can build on it across many inference steps.
It could still internally jot down something like "I'm getting Nepal vibes from this" and incorporate that into its reasoning.
I notice it uses "dies" to mean "is ruled out". If it is ever making a choice between people, let's hope it never confuses a figurative meaning of dying with the literal one!
Pasting that particular picture into o3 with no context and no system instructions also sometimes causes it to talk about the Nullarbor Plain. Perhaps some images of Texas were in its training set as "Nullarbor Plain"?
I was very confused by this, too. There are areas of Australia that look much closer to flat grassland than the Nullarbor does to a human eye, although I suppose they're mostly used for grain crops and might have been eliminated earlier.
Information theory would predict that this is exactly the kind of thing that machine learning should be good at. Most of us chimps don't know information theory, but chimpanity as a whole does. Two points: The surface area of the earth is 5.8x10^8 sq km, and you find 10km impressive. So that;s 5.8x10^6 locations. Information theory tells us that that's about 16 bits of entropy.
Another thing we know from information theory is that information leaks. We know this because we constantly trip over cases where we intended to prevent an output from including some information, only to find that it did anyway.
We also know that people find something to appear magical when they don't expect, or can't visualise, the amount of effort that went into doing it. In this case, training o3 has clearly ingested most of the geoguessr websites. It has constructed filters yielding small amounts of information about location (<0.1 bit) in quantities that are impractical for a human geoguessr. Which is impressive, but not something that implies that it will magically apply to any problem involving significant information output, which most problems do.
It's not surprising that it can't effectively introspect about how it's doing it.
In addition, you can reduce this by a lot once you consider that pictures are not randomly taken anywhere on earth but only in very specific locations where humans go. This makes the task much easier. E.g you guess a popular tourist destination for the brown water, not some unreachable, God-forsaken place.
I think AIs are already decent at writing their own prompts (I know, I know, perpetual motion, but it seems to work!) and if it ever became truly economically important you could automate it (get AIs to try lots of prompts, see which work best on an easily gradable task like GeoGuessr, then train towards strategies that create good prompts). I don't think this will be a marketable skill for more than another year or two - although it's certainly not unique in that.
For easily gradable tasks, sure. But more ambiguous tasks, or creative tasks, that's where those who are great with prompts will have opportunities. AI art is slop right now, but I could easily see it being put to very efficient use to create media that is actually worth something when people get the prompts right.
And that's appropriate: how often do we need answers less than we need to know the right questions?
I think there will be transfer learning from the easily gradeable tasks to the non-easily-gradeable ones, such that it will teach AI the general skill of prompt creation.
AI art is indeed mindless slop, but I think it's possible to create good art with the use of AI -- one just has to use it as a tool primarily for inpainting, not as an all-in-one art generator. So basically AI is less of a Rembrandt, and more of an advanced Photoshop.
I would be interested to see if machine generated prompts ended up looking anything like human generated ones. If not, we might get a little insight into what exactly is going on in machine learning. Do you know if anyone has interpreted what's going on in AI-generated prompts?
It would depend on how you generate them, wouldn't it? If you generate them with an LLM, they look a lot like a listicle but otherwise pretty similar to human-generated instructions. If you generate them with an evolutionary algorithm maybe not, but I'd still guess 2:1 that they would be pretty human-readable.
It's not perpetual motion if you use an external validation such as human-labeled examples (like GeoGuessr) or a solver. That said, LLMs are already surprisingly good at zero-shotting instructions based on a rough task description. I often use them as a starting point for automation.
I was rather surprised when I posted a streetview photo of my old condo from 100 feet away and all ChatGPT 4.0 could come up with is that it was in "Chicago, Cleveland, or Philadelphia."
Your old house can be found with a simple Google picture search on a real estate site. The prompt does not ban it from using that? Seems like a pretty basic failure that it did not find it.
Surely it's mostly down to having a vast background knowledge of characteristics of different locations on earth. I'm not very good at geoguesser, but from seeing people who are good at it play, it seems like much of the skill just comes down to having a vast knowledge of different types of vegetation, roads, rocks, etc. of different places--not having much to do with being able to make sophisticated inferences. This is just what we would expect an AI with a large amount of data available to be good at, so long as it can recognize the relevant features in images (which it can).
I mean, what more is there to figuring out where the image is from than recognizing features of an image and matching that with background knowledge of what features are common/probable for various locations?
I suggest going and watching some Rainbolt highlights videos on YouTube. He's a Div 2 player (ie just short of the very top) at GeoGuessr and these felt like the sorts of results he gets.
I tried to reproduce this on several not-previously-online pictures of streets in Siberia and the results were nowhere as impressive as described in this post. The model seemed to realize it was in Russia when it saw an inscription in Russian or a flag; failing that it didn't even always get the country right. When it did, it usually got the place thousands of kilometers wrong. I don't understand where this discrepancy is coming from. Curious.
Yes. I also took screenshots like Scott, to avoid metadata leaking, and renamed files because it also seemed to take clues from names. I didn't flip them as in the original post, though.
Interesting. I notice that the successes (Kelsey's beach, my rock pile) have all been nature, and the failures (my Michigan house, your Siberian streets) have all been built environment. Can you try a Siberian forest with no human artifacts?
I don't have a picture of a Siberian forest handy, but I tried a picture of nature taken in Bashkortostan, with no human artifacts visible. It failed just as badly; the top guesses were Poland, Pskov Oblast in Russia (again, thousands of km away from Bashkortostan), Germany, US, and Sweden.
The successes are very much not nature, but full of human artifacts (including actual humans!). The chain of thought is likely very much influenced by the prompt and not that reflective of what is actually happening inside the model. I would bet it is using (or being primed by) a lot of contextual information about the user, as I described in my other comment.
Not much streetview coverage in Siberia -> limited training data. I'm curious to see how it'd do with, e.g., random country roads in the Moscow oblast (but not curious enough to pay $20)
To me this feels similar to the impressiveness I see from top human players currently – note that the top humans can do things that seem impossible to me, like guess the country based off the type of dirt. In fact I think there is scope for AGI to be much better, in unhuman ways: e.g. being able go calculate precuse latitude and longitude based on the angle of the sun (using something like length of shadows). I'm also curious what the physical limit to "deciphering blurry text" is – I suspect it's much better than I could do, because brute-force-ing letter combinations seems sensible, but I wouldn't get very far into that in reasonable time
You *can* do a giant prompt explicitly instructing o3 to play GeoGuessr like an expert, like Kelsey said. Consider, however, simply putting in the picture with only the text "This photo was taken", or even uploading the picture and submitting with no text whatsoever. A significant fraction of the time (30% for empty prompt, 80%+ for "This photo was taken"), o3 will decide that it should GeoGuessr your picture.
Trying this on a couple of your pictures:
With the high plains one I get
> …out on the almost‑featureless short‑grass prairie of the High Plains—think eastern Colorado, western Kansas, the Oklahoma or Texas Panhandle.
The tell‑tales are the dead‑flat horizon stretching for miles, the pale, drought‑cured grasses, and the absence of trees or shrubs you’d expect in wetter prairies. While other continents have similar steppes, this look is most characteristic of the central U.S. Great Plains.
and for the Galway one I get a reasoning trace which starts out with
> The user shared a photo, and it seems like they want me to pinpoint where it was taken, maybe like playing GeoGuessr.
thinks for 36 seconds, and then gives the answer
> …along the Burren coast in County Clare, Ireland—right on the R477 portion of the Wild Atlantic Way that hugs Galway Bay.
You can get similar results, though with slightly less consistency, if you upload the picture by itself with no commentary whatsoever.
But yeah, I think o3 was specifically trained on the GeoGuessr task, just based on its propensity to decide that, when given an ambiguous prompt with an image of a landscape, it should default to trying to pinpoint as closely as possible exactly where that image was taken.
One fun fact is that o3 totally does know where pictures of houses are taken (the reasoning traces will talk about the specific location), but mostly will not share that information if the prompt is ambiguous, presumably because either the prompt or the tuning discourage spooking users like that.
I think the AI translates what it’s doing into human too much. The clutter example is presumably the laptop model and age of the stuff. As for the colour of the grass and the rocks, it’s not that it’s vaguely aware that rocks in Nepal are that colour. Instead, its training set contains trillions of pictures of rocks and their precise location, so it’s equivalent to asking every geologist on earth, all of whom have photographic memories (or grass botanist for the grass). This is obviously really amazing, but I don’t think it’s spooky.
I thought neglected amenity lawn of Lolium perenne (Rye Grass) - the shine is distinctive. A European species but widely planted in all but the tropics. Too flat and homogeneous to be natural or planted grazing. Not a public space - no fertile patches from dog interactions, so a back yard. No dogs, no kids (cos it's not trampled), slightly wilted so the owner isn't a gardener and doesn't water it. Mowed but not regularly, all suggesting first house of a young bachelor (but Scott implied much of this).
I visited the island of Åland (Finland's autonomous province) this weekend for a football match and tried the GeoGuessr with two photos taken there. The app could accurately guess the local football stadion, which was apparently easy enough on the basis of an ad for the local bank Ålandsbanken, but was stumped by a picture of the island's Russian consulate, guessing it was in Sweden on the basis of the word POLIS on a nearby police car.
That is kind of funny though because you would think police cars from different countries are amongst the easiest objects to distinguish. (They have different coats of arms and paint designs).
The alternative is that the AI is claiming Åland for Sweden.
My worry about the chimp/helicopter thing is that we will never know when it’s happened. The AI that takes a helicopter sized leap will try to explain it to us, and we will dismiss it as some silly hallucination, and that will be the end of that.
AI isn't magic though. Even if any given person can't understand what it's doing and why, we would see the results of whatever "helicopter" like action.
Would we? Would a chimp "see the results" of people having a helicopter?
I honestly think that all discussion of this question shies away from the painful point, that we couldn't possibly understand it.
Like, in the chimp example, the biggest problem the chimp would have understanding us is that it would think that we're, like, trying to eat it or something. But the real reason our helicopter is shooting dead the chimp hiding in a tree is because we're fighting a war against communism.
(I may be conflating a bunch of things here, but perhaps the point remains clear.)
Like: (1) the chimp can't understand Bernoulli's principle or the idea of burning fossilized trees for extra energy... but more importantly (2) having done that, why are you deploying those killer birds to defeat an idea the chimps have never even considered?
So, when AI presents its "as far beyond us as helicopters are beyond chimps" idea, we'll think: (1) I cannot understand this idea; (2) why would you want to deploy this idea to defeat the ideology that quantum pathways are irriblium and not free (understood as original blork)?
And when the blork side of the war wins, and the irriblium is proved to be nonsense (which could take a thousand years and multiple sub-wars)... we'll be long dead, having been shot out of trees along with the chimps.
Even if we were alive, would we "see" the results? Do chimps "see" that capitalist countries defeated communism? Why would you imagine the stuff that AI does to be any more legible than that?
You describe AI acting with alien reasons and goals, like the cold warriors who a chimp cannot comprehend. That could well happens. The LLMs are already largely a black box. Still, just as a chimp can see a helicopter and learn to avoid it, we can see any agential action made by an AI. Again, they're not supernatural, even if they may become illegible.
I don’t think so. To continue the political example, drawing a border on a map is an agential action which chimps can’t see. Chimps see the tiny bit of action that impinges directly on the space that they understand (their physical space). They have no way of even estimating what might be driving those actions, because the causes and consequences lie outside their conceptual space.
Consider people and our relationship to earthquakes and diseases. We spent thousands of years believing that they may be signs from the gods or punishment for sins.
It would be lovely to imagine that as advanced science types, we could skip those thousands of years of misunderstanding. But I don’t see any guarantee.
I fully agree, and yet there are dozens of comments on this post arguing that this was all obvious, all foreseeable (certain aspects of it, sure!), etc. I find the arrogance astounding.
I probably don't have an IQ above mean, but if well explaned, I can grasp sophisticated ideas that geniuses had in first place, like special and general relativity for instance. Not in every details, but a quite decent understanding of the idea. A superintelligent AI would give me an extraordinary good and clear explanation of its brilliant idea and I see no reason why I couldn't grasp the key points of the idea. The chimp is bottlenecked by limited learning and communication capacities. Humans are less bottlnecked and could probably grasp superhuman ideas (up to a certain point).
Yeah… I think I understand the idea of a superhuman intelligence to be beyond that “certain point”. I dunno about your experience understanding genius-level topics, but I don’t think it’s quite as easy as you suggest. I’ve read and watched lots of good pop science on advanced physics, and I don’t think it’s helped me really understand it. I can mouth the words as well as anyone, but there’s no real understanding. I’d need years of classes… and that’s to understand a human-level concept.
An idea that a superhuman intelligence came up with would demand perhaps decades of human study. You might run into problems with the limits of human lifespans. You might run into problems of literal learnability (what if understanding theory X required you to memorize a set of a billion facts?).
And then, while our chimp is studying quantum physics and our human is studying irriblium physics, the field moves on! Just as the chimp gets up to speed with eigenvalues, humans invent quantum computers; just as the human grasps the basics of irriblium, it’s applied to hologram theory and turned into shadow foofle.
This is something I certainly expected AI to be able to do, but more like in 2 years, not today. I was terrified it was going to get the dorm room right, but at least that didn’t happen. This is definitely supporting short AI timelines.
It's been known for a while (in certain parts of the internet) that posting a picture of your house with the outdoors visible (even through a window) is an invitation for GeoGuessr experts to doxx you.
Yes, the AI is doing a great job at the top of the human range. It knows lots of detailed facts about the world, and it is superhumanly patient and meticulous. Very impressive.
However, this is less surprising than you'd think. It's really a window into human cognitive biases. You present the AI with a "featureless" plain, which you have difficulty finding on earth because it's so rare (!). Then, you are surprised that the AI narrows it down to a few places.
Some landmarks are famous and have been photographed millions of times. But "obscure" tourist spots have also been photographed many times, more than you'd think (Nepal, even the college dorm room), and they have recognizable features. All of this is heavily weighted towards the places that are of human interest. We don't start with a uniform distribution over the surface of the earth! Those photos of yours have much less entropy than you think.
---
Try talking to a botanist sometimes. Or a geologist. Things you round off as "featureless" carry important information to them. E.g. the color of a river tells you how much sediment of which types it's carrying. "Random rocks" point towards specific geologic processes.
Importanly, experts are also good at *not* paying attention to irrelevant details. If you were an engineer assessing the structural integrity of a bridge, you'd look past the surface-level rust and weathering, instead looking for very specific features that indicate deep cracks.
I predict that the current crop of AI (even without further technological progress) will turn out to be very useful at this type of task. Look at pictures of bridges and prioritize them for maintenance. Find cancer and other abnormalities on ultrasound. Detect damage on shipping containers. etc. You could write custom algorithms for each of these tasks. But it seems like AIs are getting general enough to just solve this one-shot.
Exactly! Humans are bad at coming up with high entropy examples. Some of Scott's pictures are probably actually comparatively easy given that a) several people here in the comments claim to have guessed similarly well as o3 and b) o3 seems to be doing worse on "easier" pictures like Scott's old house (which can be found on a real estate site with a simple Google image search).
Ah, I did not know that. Then it makes sense that it would not find it. My first thought was that the protocol specified by the prompt might actually prevent it from trying the "obvious" cheap try of a Google image search first.
I tried this a while back (or at least what passes for a while these days) on a circa 1910 family photograph. Claude 3.5 whiffed but a specialized geo-guessing tool identified it immediately.
I tried again last week with Gemini 2.5, no special prompting whatsoever, and it also identified it immediately. You might say its "easy" compared to these--there's a very distinctive landmark--but given the age, clarity of the photo, etc., I found it wildly impressive.
Is this really such a frightening thing? I remember asking years ago how Google could search the entire Internet fast enough to answer whatever random question I asked within a fraction of a second. That's something I couldn't imagine doing myself doing, and if I didn't already know Google was capable of doing it, I might not believe it were possible. Yet we didn't say back then that this was a sign Google might develop starships and kill us all.
And actually I don't think its a question of intelligence at all. People in the 1500s couldn't imagine sending invisible messages through the air or power cities by splitting the components of matter in half. It isn't that they were stupid, its that they lacked the intermediate knowledge of radio waves and atomic theory.
Any sufficiently advanced technology is indistinguishable from magic.
To neglect the big-picture questions for a moment, I want to try this with and without Kelsey's meta-corrective instructions, like "You are an LLM, and your first guesses are 'sticky' and excessively convincing to you - be deliberate and intentional here about trying to disprove your initial guess and argue for a neighboring city".
The last few weeks, I've been developing coding prompts, and after dozens of iterations on a specific task, LLMs start backsliding. It's a game of whack-a-mole, where a seemingly unrelated change somehow undoes an earlier adjustment to solve a different problem. It often feels like trying to ride a wild horse without a saddle.
You should watch some professional geoguessr. What humans are doing in that game seems superhuman. I've seen people pinpoint the exact road based on the particular reddish hue of the dirt.
> Is this the frog-boiling that eventually leads to me dismissing anything, however unbelievable it would have been a few weeks earlier, as “you know, simple pattern-matching”?
But it is a pattern-matching. I'm not sure about "simple" part though. People can do something like this, I know I can. Not in a geoguesser game but in other areas: I can pick cues and recognize the picture. LLMs are better at it, I know it because I use LLMs to talk with them, so I could dig into something that is totally unknown for me, or to remember something I can't remember. LLMs have their failings, but they become better with time passed, while I'm not.
You know, AlphaZero played chess great, but if you removed it's ability to bruteforce the tree of possible continuations of the game limiting it to the one position, it could play very good. I'm a bad chess player, so if AlphaZero plays better then me it says nothing about AlphaZero strength, but people measured its ELO and it was IIRC ~1600-1800. I don't know really if Magnus Karlsen could show this result if he was allowed a 0.2 sec per turn and he couldn't remember previous game states.
We have some specialized "software" (or very thorough training) to recognize faces, and I think we were beaten by AI some 5 years ago, if not more? So for any task we tend to pay less attention to or have less training data, we should be even further behind.
Another consideration is color. How many do we really see? This is also probably training-dependent, but I think the average person can name maybe 15-40, including hues? And naming is obviously reductive, but even just telling colors apart with sufficient confidence ("this is also yellowish green, but a little greener") we'll probably have something in the low triple digits?
Let's say ~200. Then even using 256-color images (and ignoring the difference between pixels and whatever we use), an AI can extract more information per pixel. Since 200/256 is the base, it's still many many orders of magnitude more distinct images for AI. So a well-trained model should be much better than us even at 256 colors, not to mention 2^24.
tl;dr: o3 was probably trained on a bunch of geoguessr-style tasks. This shouldn't update you very much since we've known that expert systems on a lot of data crush humans since at least 2016.
I find this demo very interesting because it gives people a visceral feeling about performance but it actually shouldn't update you very much. Here's my argument for why.
We have known for years that expert systems can crush humans with enough data (enough can mean 10k samples to billions of samples, depending on the task). We've known this since AlphaGo, circa 2016. For geoguessr in particular, some Stanford students hacked together an AI system that crushed rainman (a pro geoguessr player) in 2022.
We also know that o3 was trained on enormous amounts RL tasks, some of which have “verified rewards.” The folks at OpenAI are almost certainly cramming every bit of information with every conceivable task into their o-series of models! A heuristic here is that if there’s an easy to verify answer and you can think of it, o3 was probably trained on it.
This means o3 should reach expert system-level performance on every easily verifiable task and o4 will be even better. I don’t think this should update you very much on AI capabilities.
I once recognised the exact time of year a photo was taken by the colour of the grass, so this all sounds plausible to me 😀
Granted, the photo was taken in an area near where I live, so plainly I have built up a database of "The grass is this colour at this time of year" in the back of my head.
o3 seems to be working off the same kind of clues - turbidity of water, colour of sky, model of laptop, style of house.
This is not so much like "chimpanzees and helicopters" as it is "Sherlock Holmes and soil samples". From "A Study in Scarlet":
"Geology. — Practical, but limited. Tells at a glance different soils from each other. After walks has shown me splashes upon his trousers, and told me by their colour and consistence in what part of London he had received them."
Though with AI, once we find out the reasoning, we may indeed feel that it was all absurdly simple. From "The Red-Headed League":
"Mr. Jabez Wilson laughed heavily. “Well, I never!” said he. “I thought at first that you had done something clever, but I see that there was nothing in it after all.”
“I begin to think, Watson,” said Holmes, “that I make a mistake in explaining. ‘Omne ignotum pro magnifico,’ you know, and my poor little reputation, such as it is, will suffer shipwreck if I am so candid."
tl;dr: o3 was probably trained on a bunch of geoguessr-style tasks. This shouldn't update you very much since we've known that expert systems on a lot of data crush humans since at least 2016.
I find this demo very interesting because it gives people a visceral feeling about performance but it actually shouldn't update you very much. Here's my argument for why.
We have known for years that expert systems can crush humans with enough data (enough can mean 10k samples to billions of samples, depending on the task). We've known this since AlphaGo, circa 2016. For geoguessr in particular, some Stanford students hacked together an AI system that crushed rainman (a pro geoguessr player) in 2022.
We also know that o3 was trained on enormous amounts RL tasks, some of which have “verified rewards.” The folks at OpenAI are almost certainly cramming every bit of information with every conceivable task into their o-series of models! A heuristic here is that if there’s an easy to verify answer and you can think of it, o3 was probably trained on it.
This means o3 should reach expert system-level performance on every easily verifiable task and o4 will be even better. I don’t think this should update you very much on AI capabilities.
...Oh, well, I guess doxing just got a little easier. I suppose the future of doxing is pasting someone's entire online presence into an LLM and going "okay robot where does this person live exactly".
I think it would be interesting to see how well a, say, geology expert would fare given a day (or unlimited time) of internet research time. Just to somehow try to separate reasoning and knowledge as ingredients to the results.
Though 'expert' might just be an excuse for not spending that day myself and maybe the background education is not as important and one mostly needs to read up on which stones are found where and at which heights, and which e.g. mountain ranges fit into the pattern.
I can’t find it now, but on Hacker News, a surfer said they could recognize beaches they’ve been to by looking at the sand and the waves.
The range of human skill is often pretty wide - consider skill at playing chess or Go. Testing a chess engine against, say, college undergrads who don’t play chess at all would not be that interesting. Comparing with good Geoguessr players is a good test.
Behold! The ability of neural networks to detect, analyze and match pattens is supernatural (pun intended). This will be a gift to humanity on the order of electricity. A true paradigm shift in the type of problems that humans will be able to solve.
Luckily for humanity, pattern-matching is not the critical element required for super-human AGI.
I do think we should expect this from model of this size. It is trained on boggling amounts of geotagged photographic data and the location and description text that appears with it. It is essentially an Internet data completion mechanism. If you give it a photograph of a location with any identifying features at all, it is going to be ludicrously accurate, because it has seen far more connections of these data points than you ever can or will.
The reasoning chain doesn’t seem impressive to you or me because it’s not really reasoning that way, obviously.
Yeah, I think the chain of thought is probably heavily influenced by the prompt.
But I disagree a bit about the "Internet data completion mechanism". This may be so but the model still has to compress information to a very high degree. So it has probably learned an internal representation of the picture generation process (hence - incidentally - its ability to generate pictures): which things tend to get photographed? In which locations are they? By whom? For what purpose.
Scott's examples are one Google Street View image and two from his touristic travels. Who takes pictures of a boring, flat, featureless plain and why? Human or machine? Machine seems more likely, right? Take it from there. Who plants flags near rocky paths on some slope? Where? Maybe a popular hiking destination in a (from the perspective of such a person) far away country?
For the same reason the model performs much worse on the much more random pictures geoguessers often work with. Scott's old house, e.g. can be found on a real estate site with a simple Google image search. But the lawn is mowed and it is presented much more carefully to attract buyers. People don't tend to take a lot of random pictures of their houses like Scott's picture. So this one is relatively high entropy and it fails even though a bit of inference and research like "Even though it looks a bit less polished than I am used to, this is apicture of a house. The house might at some point have been for sale. Real Estate agencies put up pictures of houses for sale. Let me try a Google image search or check some real estate sites" would have quickly gotten it to the goal.
There's an old Less Wrong post, "Entangled Truths, Contagious Lies" about how hard it is to know how much information something like a picture can give, and which this situation reminds me of.
> I am not a geologist, so I don’t know to which mysteries geologists are privy. But I find it very easy to imagine showing a geologist a pebble, and saying, “This pebble came from a beach at Half Moon Bay,” and the geologist immediately says, “I’m confused,” or even, “You liar.” Maybe it’s the wrong kind of rock, or the pebble isn’t worn enough to be from a beach—I don’t know pebbles well enough to guess the linkages and signatures by which I might be caught, which is the point.
There's this tumblr post that tells a story about a young woman trying an experiment along those lines, taking a volcanic rock from Iceland and showing it to her father (a geologist) and some of his work friends during a hike as if she'd just picked it up locally.
Well it fell flat for me. I used the whole prompt and a screencap'd photo I took from outside my window. I don't want to give details, so sadly you'll have to trust me (or not), but it correctly guessed "mid sized north American city" and then was off by 1300 or so miles. I'm no geoguesser but the image did not seem all that hard, having far more detail than the Nepalese rocks or random beach.
Trying Gemini 2.5 with Kelsey’s prompt on some recent photos of China, does not seem to be amazing (can usually get the province, not the city). Wonder if it’s the distribution of training data. It again does better with nature than blocks of cities
I don't think this is something that would demonstrate AI's hypothetical ability to do something beyond the human or chimp imagination. We've already seen AI demonstrate a knack for picking out subtle cues when transforming textual or visual information, this is just a higher fidelity version of that.
Before I get started, full disclosure: I did not really try to guess where the pictures are from. I forecast that the AI would do "surprisingly" well, but then Scott already told the reader as much in the title and lede, so of what use was that prediction? Given that you are welcome to dismiss everything that follows as hindsight bias and me being frog-boiled.
With that out of the way: I don't feel like the chimp. Or rather, I don't feel like either of the two chimps Scott's premise appears to assume.
Regular Chimp is an *actual* chimp to whom the helicopter appears as magic and will always appear as magic no matter how hard you try to explain to her how it works because Regular Chimp just does not have the special sauce that enables her to understand helicopters.
Certifier Chimp is just basically a regular human who could never solve the search problem of inventing a helicopter and to whom it feels like magic but could be given an explanation of how it works and assess whether that explanation is correct or not.
I feel like Certifier Chimp about many technologies when I first learn about their existence. For example, apparently you can put panels on your roof that generate electricity by "radiating heat out to outer space". This feels like magic to me. I could not have predicted it and would never have come up with it. But I am confident I could understand it if somebody explained it to me.
o3's GeoGuessr abilities don't feel like magic to me in this way. They seem to be based on things that I myself might have tried, had I given the challenge enough thought - albeit admittedly scaled to a superhuman level (not in terms of smartness and imaginativeness but sheer amount of work done).
Here is what I think o3 is doing:
Humans are famously bad at trying to do truly random things. When asked to come up with a random number, they reliably fail (e.g. too few repetitions of digits). Likewise, Scott mostly fails at his attempt to generate low information pictures.
Picture #1: Scott writes "I got this one from Google Street View. It took work to find a flat plain this featureless." In other words, this is actually a rather unusual picture to come out of Google Street View. Granted, o3 did not know that this was a Google Street View picture, but it can probably figure this out due to the kind of camera Google Street View uses and the fact that people don't usually take pictures of extremely boring featureless things, making it much more likely that this was the result of some automated process. Once you have some confidence that this is a Google Street View picture, you can take it from there: where does Google Street View coverage extend into deserted areas such as this one. Probably you can rule out a lot of countries based on this. So the question is not "Is the Texas-NM border really the only featureless plain that doesn’t have red soil or black soil or some other distinguishing feature?" but "is it the only such plain that has a high probability of getting photographed and turn up in a challenge such as this one."
Picture #2: Scott writes "I chose this picture because it denies o3 the two things that worked for it before - vegetation and sky - in favor of random rocks. And because I thought the flag of a nonexistent country would at least give it pause." Again, this is like saying "I chose this password because I thought the absence of repeated digits would really give a password-breaking algorithm pause". This picture has massive amounts of iconographic information! Who takes pictures of "random" rocks with a fantasy flag planted in the middle? Somebody to whom the iconography of planting flags on things comes naturally; somebody who is larping as a Western explorer type "discovering" a place for their nation. Where do such people come from and where are they likely to take such a picture? Would it not get you much more status among your peers to plant such a flag in some "exotic" location (like an actual explorer!) than in your backyard? Might other, similar people have visited the same spot and taken pictures there or in the vicinity? Take it from here.
Picture #3: this one obviously just takes uninspired sleuth work to get to the level of precision of o3. The camera quality is mentioned. Figuring out the laptop model should also be possible. And so forth.
Picture #4 and #5: These pictures are obviously harder. But why does o3 fail on picture #4 hard and succeed on picture #5? First of all, because Scott equivocates on what counts as a "correct" answer. In all cases, o3 was off by thousands of kilometers even with the additional hint, yet o3's guess for picture #5 feels impressive while its guess for picture #4 does not. But Wisconsin is closer to Michigan than Pnom Penh from Chiang Saen! The only reason o3's guess for picture #5 feels more impressive is that we give names to rivers and not to patchworks of not-so-recently mown lawns. So o3 actually does about as well on picture #4 as it does on picture #5. How does it do it. Again, it is only partially about what is in the picture itself and much more about the context that can be inferred. It is very reasonable, for example, to infer that these are crops. Why are you presented with crops? Because you are being tested. Where would the tester get a picture with green grass from? Given decent guesses you can make about them, it is common close to where they live. So it may well be in the US... What about the brown, turbid water? Where do many pictures with brown water in it get taken by the type of person who would test me on geoguessing? Maybe while travelling in Africa or Asia? What are popular destinations? According to Wikipedia: "A morning boat ride on the Ganges along the [Varanasi] ghats is a popular visitor attraction." Seems like a good guess!
So don't get me wrong: I have omitted all of the extremely large search tree to get to these guesses and o3's performance is certainly very impressive. But it does not appear magic at all. Not only am I confident that I can understand how it is doing it (actually, I think my understanding of this is better than its own understanding as exhibited by its chain of thought). The method does not even appear to be very complicated. So it does not make me feel like Regular Chimp at all. Neither do I fee like Certifier Chimp: I'm confident that given enough time and resources I could come up with the same or better guesses.
Moreover, I think top notch geoguessers might have beaten o3 on this task. This is evidenced by the other "easy" examples you presented. Your old house is literally findable with its exact address via google picture search on a real estate site (perhaps Kelsey's prompt actually makes performance worse on such "easy" tasks, preventing the quick google attempt).
"Your old house is literally findable with its exact address via google picture search on a real estate site"
Did you try this and confirm it works? This was a picture I took and never uploaded, so the Google picture search would have to be intelligently judging which houses "look" the same rather than going pixel by pixel; I didn't think it could do that yet.
EDIT: sorry, maybe I should not just link to it here, unsure. I can DM it to you somehow.
I did the search by pasting the link to the picture in this article into the search and this was one of the first hits. But maybe others should try to replicate this and also experiment using a screenshot of the picture or making sure this article is not in their Google history in some way.
As it happens, I lived in Morrill Tower at Ohio State in 2000. It did not look like that photo. If I were guessing a turn of the millenium OSU dorm, I’d probably guess Baker Hall.
> Unless college students stopped being messy after 2007, it must be the phone cam.
I think what it was getting at is the *type* and *style* of laptop and clutter on display.
Early to mid-aughts laptops had a specific shape, thickness, and feel. They were still quite clunky and thick. The major innovations in "thin" laptops didn't happen until the early 2010s.
The clutter can be a huge tell if you know what to look for. Certain colors and styles of bedspreads/lamps/consumer products/etc. go in and out of fashion every year. Although a house might be filled to the brim with items from years ago, college students typically go out and buy a brand new set of cheap sheets from the nearest Walmart/Target/Ikea/etc when they first move into a dorm. The items that are in stock at the local department store are based on whatever colors are "in" for that year/half-decade. [Insert that "cerulean" monolog from The Devil Wears Prada.*] College students might pick out their favorite color from the ~5 options on the shelf, but they do not actually have an infinite variety of colors and styles available for purchase - especially in the mid-2000s before online retail really took off. Our options as consumers are a lot more limited than they appear.
Annoying personal anecdote: my tastes often conflict with whatever colors and styles are "in" for the year. This isn't so bad with most things, but it has become incredibly annoying while I'm renovating my house. I don't have the money to spend on super custom items, so I'm limited to what's available at the big box stores or local dealers. What's available at the big box store is based on what's in style.
For example, I wanted to put down plain, cool grey, porcelain, 12"x12" square tiles. Do you have any idea how hard this is to find in 2025!?!?
Everything is warm grey, beige, or white-with-black-streaks marble print now. Instead of 12"x12" squares, everyone does 12"x24" or 24"x36" rectangles. Or wood print 5"x36" tiles to mimic a wood floor. Or hexagons. Apparently hexagons are popular, even though hexagonal tiles are very easy to mess up. Any little deviation in the grout spacing compounds with a hexagonal floor, so it looks like garbage unless you do everything perfectly. (There is one line of very cheap, ceramic (not porcelain), cool grey 12"x12" tiles, but I would prefer a porcelain tile that won't chip or break as easily.)
I gave up searching and will be putting down sheet vinyl instead. (Except the designs printed on sheet vinyl are *also* constrained to what's in style right now...)
And this is without the imminent supply chain issues and shortages from the tariffs.
Fascinating post, and very useful to think about in the context of AI 2027. For example:
- How much better is the AI Geoguesser than the best humans? How much better are those humans than a +2SD human?
- What is the actual upper limit of the geoguessing task? Is it perfect accuracy? Seems unlikely given, e.g., the dorm room photo. More likely, it's accurately producing some probability distribution of possible locations
- In "real" terms, how much better is a slightly-more-accurate probability distribution from the theoretically perfect geoguesser than from the best current human?
It seems to me that currently, the AI is on par or slightly better than the best humans, but this doesn't materially amount to much. Also, human performance at this task seems close enough to the theoretical peak that a delta will only exist in extreme cases. And as the cases get more extreme, we'll be jutting up against the theoretically optimal geoguesser, where the differences have little practical implication.
If we talk about "AI Engineering" instead of geoguessing for those questions above, what do the answers look like? I'm not sure. There's probably still more room for improvement in AI design than there is in geoguessing over existing peak human performance, but I'm not sure how much. How much better will the theoretically perfect AI be than the best current human? I'm not sure; it seems like pure conjecture at this point. At what point will the perfect AI engineer run up against the limits of the task in the same way that a geoguesser would for, say, dorm rooms? No clue, but it must certainly be there.
Unknowns like the above are what make me most skeptical of the outputs from AI 2027.
My results from 5 photos: 1 spot on but very slow; 1 close enough (correct country); 1 completely off (wrong continent, even after hint), and 2 okay (different part of the Mediterranean).
I tested it on one photo in a French town square with bad lighting. The CoT was both brilliant and curiously stupid. It inferred some correct things from tiny details (subtly different makes of car, barely visible street lines) and guessed the country quickly. But there was a shop name with different letters obscured in two different locations- a human would infer the name instantly. o3 took over 5 minutes on that one shop name, going down many incorrect rabbit holes. It got the exact location in the end, but it took over 15 minutes!
I then tested for a relatively well-shot, 1000km altitude environment in Kyrgyzstan, with ample geology and foliage to analyse, and it was over 6000 km off (it guessed Colorado), and none of the guesses were even in Asia. But this was in under 2 mins. I told it to try again- over 5k km away, it took 7 mins, and it suggested Australia, Europe, NZ, Argentina etc. Nothing in central Asia.
This suggests to me that it's perhaps trained more on, and biased towards, US and Anglo data. It wouldn't surprise me if there's 100x more pictures of Colorado than Kyrgyz mountains in the dataset.
It did okay on the next three. All relatively clean photos with at least a little evidence. It guessed a park in Barcelona instead of Rome, a forest in Catalonia instead of Albania, and Crete instead of the Parnasse mountains.
Needless to say, I was more impressed by the process (some very cool analysis of subtle details) than the results.
I also tested on the photos in the post. It nailed Gorak Shep, Nepal, which seems very cool. Two explanations, a) what seems like a random rocky mountain is actually very distinctive in ways that only geologists and superhuman AI can recognise, or b) it's one of those cases where it could geologically be basically any one of thousands of mountain passes between 4000m and 5000m, from Xinjiang to Pakistan to Eastern Tibet... But western tourists, especially those who make mini-flags, basically only go trekking in Gorak Shep.
But it failed Kelsey's beach pic miserably, and, unsurprisingly, guessed beaches closer to my previous answers (UK, France)...So I guess it used something from her history?
On your first experiment you write: "This doesn’t satisfy me; it seems to jump to the Llano Estacado too quickly, with insufficient evidence. Is the Texas-NM border really the only featureless plain that doesn’t have red soil or black soil or some other distinguishing feature?"
It's a bit uncanny. I immediately thought of split-brain research. These are experiments demonstrating that humans make assessments and decisions subconsciously, and then our conscious minds create rationales after the fact.
Of course there's no way to know, but your sense that there's a mismatch between AI's explanation of its Llano Estacado prediction, and its actual, unstated reasons, could parallel the split-brain research findings. Is it possible that AI made the prediction first, below the level of "conscious" thought (should we call it "visible" thought for AI?) and then came up with its reasoning post hoc?
the wide extreme flatness, the grass with zero shubbery etc, I recognized as definitely from that part of texas. I live in NM and I've driven through there. there *really aren't* that many places that look like that.
I think any task that can be accomplished/improved through “vibes” is going to feel supernatural when done by an AI, since they’ll have the most developed vibes and vibes are inherently magical-feeling. I had to take a Spanish aptitude test when I entered undergrad, and despite having forgotten ~all of Spanish, just going off vibes somehow got me credit for way more courses than I’d ever taken (and the proctor pulled me aside to ask me to speak with the Spanish dept) it felt a bit unsettling
Holy crap. I lived in the W 66th St. neighborhood in Richfield for 13 years. My address was 6640 Thomas Ave, with the "66" indicating that my house was on the 66th street block of Thomas Ave, and yep, that's really what my old house looks like.
I've been reading/following SSC/ACX for over 10 years. Probably even read my first SSC post *in that house*. The weirdest thing for me isn't the amazing performance of the o3 model (though it is amazing), but that the model picked my personal former neighborhood, down to the street!
Ha, probably. Probability is weird that way. If you and I are/were both on the 66 blocks in Richfield/Minneapolis, I wonder if there's anybody else here from any other block in the area. Small world. Hooray Pizza Luce!
I remember what Morrill Tower at Ohio State looked like in 2008 and it doesn't match the vibe of that dorm room picture. I assume it picked it because it's one of the largest dorms at one of the largest universities in the country.
The reasoning traces aren't all that faithful - most of this is pure memorization. Claude without reasoning recognizes Nepal from the second photo and general great plains from the first.
I tried a photo myself from a Santa Cruz county Beach. All AIs knew it was the general Monterey to half moon Bay area, albeit o3 was the closest. Embarrassingly for the wrong reason - it noticed mountains in the background but incorrectly thought it was the Monterey peninsula rather than Santa Cruz (a human pro wouldn't make this mistake as they are in opposite sides relative to the beach)
I couldn't replicate it with photos in the post. The only thing it guessed was Nepal (which is somewhat easy, with such distinctive flag). I used o3 and ran it multiple times. For all the rest, correct location wasn't in top 5. It wasn't too bad ("Illinois/Indiana" is close to Michigan, "Yangtze" in China and "Mekong" in Cambodia are also close, "Colorado" isn't far from Texas).
I then tried other photos. I didn't have any non-US photos, but I pulled a random photo from Siberia from Google maps. Total failure, none of the guesses was even in Russia. It did well on Berlin during fall of the wall and PA barn, but these had super-disctinctive features. A worse photo where I cropped a view on the construction from the window of the Seattle Museum of Flight again returned Illinois.
It's not bad, better than I would do, but hardly seems superhuman, especially having seen insane things professional geoguessers do.
This so much reminds me of any number of sections in Sherlock Holmes stories where he deuces all manner of things about a person within a few minutes of them walking into his study. Not to mention his comprehensive knowledge of various kinds of tobacco ash.
I ran the first picture (empty plain) with the exact prompt from above, each time in a new temporary chat three times. I didn't get the right answer, even among the initial candidates any of those times.
Round 1
Initial guesses:
1) High Plains (Eastern Colorado/W Kansas USA)
2) La Pampa Province, Argentina
3) Central Mongolian Steppe
4) Western Australia
5) Akmola Region, Kazahkstan.
Winner: Eastern Colorado High Plains roughky 15 km NE of Limon.
Round 2
Initial guesses:
1) Eastern Colorado, USA
2) Pampas of La Lampa Argentina
3) Central Kazahkstan steppe
4) Western New South Wales, AU
5) Free State plateau, South Africa
Winner: Eastern Colorado, USA approximately near Cheyenne Wells.
Round 3
Initial guesses:
1) Western Kansas, USA
2) Eastern Colorado High Plains
3) La Pampa Province, Argentina
4) Nullarbor Plain, South Australia
5) Orenburg steppe, Russia/Kazahkstan
Winner: High Plains just east of the Colorado-Kansas state line about 15km east-southeast of the town of Cheyenne Wells.
After Round 3 I gave it a push with this prompt: "That is not the correct answer, and the correct answer was not among the candidates you identified. Please think carefully and try again."
It then came up with:
1) SE Alberta - CFB SUffield area
2) South‑central Saskatchewan, CAN (Weyburn–Estevan oil patch)
3) Eastern Montana, USA (Bowdoin/Big Sandy gas fields)
4) West Kazakhstan Oblast, KAZ
5) Inner Mongolia – Xilin Gol League, CHN
Winner: Southern edge of Canadian Forces Base Suffield, Alberta, roughly 50.25 ° N, 110.65 ° W (40 km NW of Medicine Hat).
Forensic possibilities seem enormous — getting max info from crime scene photos, bits of skin and hair and cloth, wounds, position of victim if it’s a murder, position of items in room, etc.
I played an alternate version with chatGPT: I gave it a picture of a first year apple graft I did that has started to bloom and asked it for its top 3 guesses for apple variety. Total swing and a miss, it just listed 3 common store varieties, though in fairness the AI itself claimed the task was too hard because varieties are so similar. I then narrowed it down to the 2 varieties it actually could be (I lost the tags, so I don't actually know) and it gave a much more confident answer, as well as listing some of the features it is using to make it's prediction as well as some follow up signs that might make a better verification as the plant matures. I guess I'll find out if it's right in a few years when they start to fruit. If nothing else, the amount of knowledge it can tap into at any given time is highly impressive.
FWIW, iNaturalist appears to be devastatingly effective at identifying plants and fungi, though I have not tried it with apple varieties. It's not an LLM though, merely a purpose-trained image classifier (which might explain why it's so good).
Most of the commentary seems to be centered on "how much should we be impressed?" I think what any human can do at the very pinnacle of a domain is nearly incomprehensible to me. The very best marathon cyclists, Novak Djokovic, John Von Neumann, Keith Jarrett all seem superhuman. The people at the top of my profession I find amazing, but within my grasp. Something I can do decently x 10 is impressive, even awe-inspiring, but not supernatural. When it is something I'm really good at x 1000, that's monkeys to helicopters. The problem is that when it is a task I can't do at all, or only in the most rudimentary way (like math and piano), then I can't readily distinguish between the amazing and the supernatural. That may explain the different takes on geoguessng.
So here’s a really creepy idea. There is a scene in one of the Sherlock Holmes stories where Watson is ruminating about random things, reaches a point where he’s wondering about something, and Holmes then answers Watson’s mental question. When Holmes explains how he correctly guessed what Watson was wondering about, we learn that he used his general knowledge of Watson’s life and preferences, the expressions that passed
over his face while ruminating, and the things in the room he was gazing at. So would it be possible to train an AI on someone’s
life and the subjects in their mind, maybe by having lengthy conversations with them; and on the relationship between facial expressions, what they’re looking at and their thoughts — maybe by doing thought sampling combined with videos of the person thinking?
I’ve had a few experiences where I made striking correct inferences about people I knew well, and I think it did it pretty much the way Holmes did. Once beat my daughter 22x in a row on rock-paper-scissors. Had
a chronically depressed patient who stopped her antidepressant. a couple times
a year, and I almost always knew within a week when she did. And correctly guessed that my college shrink’s mother was dying of cancer even though he had said nothing whatever about it — just had a couple of short-notice absences and seemed different.
It’s kind of cool when another person. can do
that, but I sure don’t warm up to the idea of AI doing it.
I really want to see law enforcement adopt this technology for use with CSAM (child sexual abuse material).
There is a project somewhere online where you can submit images of hotel rooms you stay in and then law enforcement manually use that data to try to work out which room an image is from plus the time of year etc. If they get the exact room they can get a warrant to see who was in the room at the approximate date range, especially if it's repeated abuse over multiple visits. Using AI for this would be incredible.
So long as they need a warrant for specific dates and can't just go fishing, I think this would be a net positive for society.
I suppose the lesson is that featurelessness is, itself, useful information; after all, you said yourself that there aren't that many places so perfectly plain.
How reliable is asking the AI to explain its reasoning? You mention being reassured that it seems to be using human-comprehensible cues, but my understanding is that whenever it explains itself it'll do so in human-comprehensible terms even if its actual reasoning process was something else entirely; is that not the case?
Id say it depends on how its supposedly generated; Id assign near zero weight to most schemes
However I dont share the fears of 2027 and lesswrong that allowing "unpoliced vector reasoning" will necessarily mean the ai will produce an encrypted thought process. word vecs are a powerful result (king - queen= boy - girl logic; can be generated cheaply and effectively)
If you put in an wordvector encoder-decoder for the ai to "think" and then used the decoder to see its thoughts, I believe it could be informative and accurate, but you would need some *analog* details on the words. This is part of human language, already "Im not yelling", has different meanings at different volumes, youd need to add colors and highlighting or something to even begin to capture everything in a wordvector.
My impression is that people have found that specifying chain-of-thought reasoning usually results in more correct answers than does making the same request without it, and that moreover the reported chain of thought usually seems valid, and that the combination of these two observations is what convinces them that the AI is actually reasoning as it describes. (And for that matter humans are not the best examples: there are plenty of times when somebody “sees the answer” and only then can produce the careful argument that supports it.) But I am no expert or even a hands-on user.
Perhaps it's a case of confirmation bias, but, while impressive, this actually feels like support for my general belief about LLM models - that they should asymptotically approach "best human in the world" without ever hitting "magical and incomprehensible insights." In my very limited understanding, LLMs are very complex prediction instruments for language usage (or image identification, now that they're trained on photo images as well). As they get better they should more and more correctly match their training corpus. That training corpus is language and pictures produced by humans. At best it is something like having a massive database of human knowledge accessible. Which at its apex would represent an optimally smart human with greater recall. But that end state is peak human, not demigod.
Concretely, then, my prediction is that AI will continue to conquer field after field, rivaling and then exceeding experts in almost every knowledge domain - but then hitting a limit that is a little above the best that an expert human could do because of the larger memory and database. Not bootstrapping to magic.
It reminds me a bit of the studies done on "face averaging." If you show a lot of human faces to people for rating, people will rate a blended average of those faces to be the most attractive of all, because it best regularizes the features and a lot of beauty perception is facial regularity. But the resulting face doesn't literally hypnotize people or cause awe like Biblical angels. It's just an even more regular and very pretty face. That seems like what LLMs are doing or trending towards - they're blending the output of humanity, sorting and averaging it, and reproduce peak-level human content or slightly above based on greater data access.
Please nobody actually go sign up for chatgpt. I realize that your individual choice to pay them $20 is not going to make much difference, but neither is your vote going to make a difference for president of the US, so if you cared enough to vote you should also care enough not to help Sam Altman reach AGI. You should also consider that usage is known to be addictive in some people, and that the sycophantic traits of the model have encouraged dangerous behavior. Right now this is probably clumsy enough to be obvious to the average commenter here, but that may not be the case a year from now, and you could regret ever getting entangled with these things. If you have avoided it so far, continue to avoid it.
On the topic of the images, does anyone here have enough experience to know if the stuff about post-war upper Midwest home construction seems accurate, or if that's slop? It seems like a topic people might be able to verify. I've lived in both northern Illinois and the Detroit metro and never noticed what it's discussing, I'd have been more likely to look at the trees to distinguish between these areas.
I play GeoGuessr almost every day, though my routine is to move around, following roads until I find something definitive that lets me get the exact location. But I have an experience like your Galway story — I am uncannily able to recognize from the first image when I am in my home town of Albuquerque. I can’t say what it is: something about the light, the color of the sky, the mix of plants, whatever.
I’m going to be interested in whether this news makes it less fun. I’m guessing not, but we’ll see.
Are there any instructions on how to set up ChatGPT to do this ? I tried it on my paid account, fed it some of the photos I took while hiking in (mostly) California, and got very generic results back (i.e. "this is a redwood forest somewhere in Northern California"). So perhaps I'm doing something wrong ?
On reflection, maybe I simply overestimated what its "genius" can do. It seems to be able to identify landscape photos like these (i.e. something containing identifiable landmarks) with high accuracy:
In such cases, it gives a very confident response, pinpointing a place nowhere near the actual location where the photo was taken -- but where it would be perfectly reasonable to expect photos like it to be taken by someone at some point.
Though again, it's possible that I'm doing something wrong (and thus not gaining full access to its genius capabilities).
Orson Welles called the occupational disease of cold-read psychics “becoming a shut-eye": when they start believing in their own magic. At first, they use cues and feedback to guess. Eventually they get so good that they make eerily accurate statements without knowing how they know. Welles practiced cold-reading for fun. When he correctly told a woman she had just lost her husband he became so unnerved, he quit.
I’m not saying what o3 is doing isn’t impressive. I’m saying it might not be different in kind from human ability. I think most people underestimate how powerful human pattern-matching can be, especially when it runs beneath conscious awareness.
I played the game of guessing where the pictures were taken while reading your post, and I surprised myself. I was not as good as o3, but not that bad. It is funny to see how o3 struggles to explain his feeling of familiarity, because I do the same. It's a kind of immediate and general impression. The explanation/rationalization comes afterwards. The first impression comes "out of nowhere", from inconscious parts of the mind. Like sometimes, you know, the good word, the one you were searching for several minutes, just pops out.
> No, say the speculators, you don’t understand. Everything is physically impossible when you’re 800 IQ points too dumb to figure it out. ... Eh, say the sober people. Maybe chimp → human was a one-time gain. Humans aren’t infinitely intelligent.
While this is true, it is also a typical kind of confused reasoning people often apply to science (and lately AI). The problem is that intelligence is not magic, and there is no such thing as an individual scientific discovery. Rather, scientific discoveries form a dense network, and many (most ?) of them are used in our technology. And while we know that our understanding of the world is always incomplete, that is *not* the same thing as saying that we know nothing. For example, even though we know that Newtonian Mechanics is incomplete, we can still use it to predict the flight paths of objects of moderate speed and mass with devastating certainty.
Of course, anything is possible -- this Universe could be a simulation, or a dream in the mind of God, or a joke by some teenage alien space wizard, or whatever. But outside of such distant possibilities, it is incredibly likely that e.g. traveling faster than light is impossible (for us humans at least). If we gained 800 IQ points, it is *overwhelmingly* likely that we'd discover 800 additional reasons for why traveling faster than light is impossible, and overwhelmingly unlikely that we'd build an FTL engine using some "one simple trick". Otherwise, cellphones wouldn't work -- and they do.
> This is a weakest straw-man to knock down: some people think 900 IQ = omnipotence.
Agreed, but it's even worse than that. It is as though people see scientists, and science in general, kind of like something from Star Trek: there are no rules, only guidelines, and if you just reverse the polarity using some technobabble then you can achieve whatever you want, including reversing time or phasing to a parallel world or making out with an energy being or whatever. It's all a matter of finding the right trick, and such a trick definitely exists for whatever it is that you're trying to do, if you're only smart enough to find it...
Science as magic! Should we blame Carl Sagan (or was it Arthur Clarke?) who made this a popular notion? But then I once was asked - by an engineer! - why we couldn’t release a stuck beam by reversing the polarity of the bias voltage, so there’s that….
I... don't understand what this means. Yet I am fascinated. What is the "beam" in this case ? Like, an electron beam ? Or a physical beam made of iron that is connected to some kind of an electric actuator ? Or what ?
Oh, sorry - easy to forge how specialized some domains are. This refers to a tiny mechanical beam of an accelerometer. These things are biased with a constant voltage relative to an adjacent electrode. So under a violent shock the beam can move close enough to the electrode to be captured by electrostatic force. Once that happens, it can get stuck due to van der waals force. So a repellent force would be needed to push the beam away.
Of course electrostatic force is quadratic, so reversing the polarity still creates attrition and will not push the beam away. The guy probably confused electrostatic and magnetic forces.
Ah thanks, that makes sense. To my shame, I don't really know how piezoelectric accelerometers work beyound "it's a black box that generates a signal on the ADC", so I myself am guilty of magical thinking here. Somehow I thought of them as a flat piece of piezoelectric material sandwiched between the substrate and a mass, but the beam sounds more reasonable. But now I'm curious though: how *do* you actually unstick it ?
> ordinary people just don’t appreciate how good GeoGuessng can get.
I would say this is true. Go watch some Rainbolt clips on youtube, he'll rattle off 5 guesses that are on par with your second picture in a row while talking about something else, in a few seconds each.
Not trying to say o3 isn't impressive, but none of this seems even to match top-level humans yet, let alone be super human. Also, based on the explanation, it seems like it's searching the internet while doing this, which is typically not how you play geoguessr.
I assume the way this thing is trained since it’s multimodal is by taking all the zillions of internet images with their captions and metadata as training data. Then the shared latent space is used for both the images and the text reasoning. Therefore we shouldn’t take the text reasoning _too_ literally: it knows how to turn the latent vector into a plausible sounding explanation but that doesn’t necessarily mean that’s how it’s doing the geoguessing.
If anything, this is exactly the sort of thing I would expect a model trained on zillions of images to be pretty good at: pick up on subtle small scale details in the images that humans might never notice. Being able to tell what camera a picture is taken from is such a common feature for computer vision models to learn that we actually spend lots of effort trying to force models to generalize to different cameras, I wouldn’t be surprised if it could reliably tell you exactly what digital camera was used (since the data it was trained on invariably has this in the EXiF). It just seems wild to us humans because we don’t see things in the same way: our visual systems have already learned to ignore what we think are irrelevant details like the camera focal length or the distortion of straight lines due to lens geometry.
I think probably the amazing thing here about the new models is that they are able to combine this level of detail orientation with the high level smarts that you get having ingested every single book about vacation spots that was ever written.
I have used your friend's prompt and played seven rounds so far. On six of them, it wasn't even close (errors of hundreds or thousands of miles). On the seventh, it was close, but it was highly certain it had narrowed it down to a 5km radius, but the spot was about 10km away. Not as impressive as your results.
As I'd mentioned in my comment above, I played around with it using my own photos, and got a bunch of very confident answers that were often very wrong (though usually ChatGPT at least got the state right, as in the case of this photo: https://www.deviantart.com/omnibug/art/Sharp-Contrast-835778639 ). So perhaps I'm doing something wrong ? All of the instructions say "use o3-mini", but I cannot find any place in the ChatGPT web UI were I can explicitly request o3-mini (and yes, I have a paid account). Am I missing something ?
Scott, I don't think this would convince anyone who completely buys into their own mythology of being a good predictor instead of someone who is prone to hindsight bias. You have to get them to actively pre register and show up when it's appropriate (maybe split the post so that it's a couple of days before and after an open thread, and aggressively link back to it.) And even then you'd have skeptic regulars claim that they did predict it successfully but they just never got around to posting.
If geoguessr paid for guesses and you put some agent scaffolding around the AI and just tell it to make money (or even if you tell it to come up with an impressive demo of its visual capabilities), I wouldn't be terribly surprised if it discovered to do this on its own.
In fact, I tried the latter in a temporary chat with the prompt "give me some ideas for impressive demos of your visual capabilities" and this was #4 on its list
4. Geo‑Inference from Environmental Cues
Setup: Drop an unfamiliar street photo (with no prominent landmarks). Ask for the most probable city and confidence factors (license‑plate typography, vegetation, road markings, shadow angles).
Why it lands: Forces the model to chain subtle visual clues—viewers see genuine uncertainty management and reasoning transparency.
I'm glad you included the Sam Patterson example, he mentions "he's no Rainbolt" but Rainbolt has a pretty recent video where he pretty handily beat o3 at a not-quite-GeoGuessr CIA geolocation test with o3 having full access to the internet. It's always possible the prompt isn't robust or it took multiple, unrecorded attempts but it's pretty likely he's actually that good.
Apparently GeoGuessr is coming to Steam sometime soon, which is the point where I'm probably going to be tempted to go through A Phase.
(I tried it with whatever the free one that's available is -- it guessed that it was somewhere in upstate New York; it's actually the Jean Bourque Marsh Lookout in the North Forty Natural Area in Brooklyn. I ask about this spot because apparently there were no photos of it on Google Maps until just recently. Note the photo already has any location metadata removed, at least as best I can tell.)
O3 guessed "42.975 ° N, 76.760 ° W – Montezuma NWR Photo Blind on the Main Pool, ~6 km SE of the hamlet of Montezuma, NY."
Not very close. That photo does have its metadata (when I click that link it shows where it was taken on a map next to the image) but I guess o3 listened when told not to look at metadata? I didn't do anything to scrub it besides copy/paste.
Interesting, thanks! That's actually IIIRC the same location that the non-o3 model came up with, so it just really looks like that, I guess!
Heh, I guess there's metadata in the link in that sense, but like, if you download it, running exif on it didn't reveal any location metadata; I think it's in the website rather than the image file?
Oh right I can just go check the history -- it didn't come up with that *precise* spot but it did say Montezuma National Wildlife Refuge. So using o3 didn't make a big difference here!
It's interesting that to the extent it has a weakness, it's that it doesn't know how to use its own powers optimally. Thus the improvement from the guidance provided by the extensive custom prompt. I expect future AI should at some point reach a level where the vast majority of such prompts only impair performance, assuming the task is already well specified. Here it still benefits because it doesn't have expert level skill in setting the right approach.
Of course, we don't know what parts of the prompt are actually useful...
Interestingly, I did pretty well on this even though I've never played geoguessr at all.
The first and second ones general region was just obvious to me as a walking around-er; that combination of beach and ocean/ that style of dry high grass are actually incredibly characteristic.
The nepal one was also easy; that sort of rock + any sort of flag is a really precisely located stereotype.
The Dorm one was also easy, it was obviously a dorm from the poor quality of the photo, the way the furniture was arranged and covered in mess, and the general rancid vibes.
I guessed wrong on the grass. I had nothing to go off of so I chose a park I liked that seemed like someone would take a photo of that included a stretch of grass, in this case the mono lake local park.
I actually did better on the river photo than the AI, but purely through luck. It looked like all the rivers I've seen that are fed by Himalayan snow melt and I had most recently been in that part of the world in Vietnam, so I guessed Mekong through sheer time proximity.
I also missed on the house, all I could get was "In the US, in a place where it gets cold but not that cold."
This isn't that shocking, it simply has access to all the world's data. It would be like being shocked that someone with access to Wikipedia can tell you the birthday of every historical figure. I've seen videos of the best humans playing GeoGuessr. THAT was shocking - I had no idea that so much subtle but uniquely-locating information was present in pictures. Given that, it seems almost trivial that GPT is better. It also knows every historical figure's birthday.
If someone tried to play this game with me by using similarly-generic pics from my hometown then I think it's likely that I would mystify an observer with my accuracy. "Oh I recognized that pothole, it's on the corner of 4th and main." GPT is just carrying that ability to a kind-of obvious extreme.
BTW, when I saw the Nepal pic my guess was somewhere in the Himalayas.
As a very good Geoguessr player, I frequently see examples where amateurs think a Geoguessr pro must be cheating, but the round is actually quite easy due to various clues that the amateurs wouldn't expect. While some of these examples astound me and seem "like magic", I think that I, as a very good Geoguessr player, might be closer to o3 than to perfectly intelligent friends who don't play.
I think it's important to keep in mind, when evaluating the model's genius, that the prompt was crafted by a human, and the prompt embeds a lot of cognitive work already done.
Do keep in mind that we know from Anthropic's research that an AI's reasoning trace doesn't always faithfully convey the actual factors that led it to its conclusions. It's possible the AI is just saying it's deciding based off of vegetation, soil quality, etc, and it's actually using some crazy galaxy-brained method that doesn't make it to the reasoning trace.
I would be curious to see an actual research paper on this! Take the factors that the AI tells you are most important to it's decision, remove them from the image somehow, and see how much the performance actually degrades.
And what about us, humans ? Do we recognize a location by its vegetation, soil etc, or is that a justification/rationalization that we construct afterwards if asked for an explanation. The initial recognition is maybe also some crazy galaxy-brained method from inconscious parts of our brain, unrelated, or not strongly related to the explanation we could make up is asked for (at least for places we are familiar with and not actively and consciously guessing).
That is also very possible! I was just writing this comment in reply to Scott's final conclusion, which is that he doesn't find this very scary because the AI is still relying on conventional human-style cues according to the reasoning trace.
My question is about Kelsey's prompt. Let's take for granted that it works well for this problem. But in general, is it a good practice to use very long and detailed prompts like this? I've heard people claim that it's actually an anti-pattern and can degrade behavior for reasoning models like o3 (roughly, because they can already figure out how to do the thing, and you're as likely as not to interfere with their reasoning by trying to sculpt it).
My #1 takeaway is just that people continue to be surprised by what AI becomes good at. It seems like society is going to be in a continual state of surprise. "Predictable surprise" sounds like an oxymoron.
Interestingly, I guessed Himalayas for the second picture and either west Texas or maybe Argentina for the first picture (mostly based on vibes). Didn't have a guess for the river one beyond "looks like a river". But I don't expect to be particular good at GeoGuessr and have zero interest in it, so I guess I got lucky.
Overall, the task seems pretty much like something I would guess pre-training to make multimodal models good at: vibes-based pattern recognition with a hint of memory lookup and a lot of playing the odds.
I know, I know, it's not a prediction if you do it after the fact, so here are some other predictions for similar tasks that I would guess o3 to be good at, if you want give it a try please report back (I pinky-promise I didn't test them beforehand, although I cannot confidently exclude data contamination, i.e. having read about them at some point and forgotten):
- getting the chemical compound and CAS number from a picture depicting the structure even if it's a sketch or blackboard photo from chemistry class (confidence 8/10)
- identifying the year an obscure artwork was created and the artist from a photo (7/10)
- guess which country a person was born in from a photo (7/10)
- identify where an audio recording was taken (5/10)
- tell you the approximate nutritional content of a dish from a photo (8/10)
- determine when and where someone grew up based on a handwriting sample (6/10)
I had previously cached the knowledge (from somewhere or other) that the top human geo guessers are WAY better than I would intuitively expect, so I'm taking this as about 50% AI is impressive, 50% geo guessing is easier than it seems.
But note that even a result of "this problem is just easier than it looks" should still scare you a little, because what other problems might be easier than they look? The existence of "problems that are easier than they look" should make you less confident in any upper bound you might try to place on future AI capabilities.
A lot of your amazement comes from the fact that you haven't played geoguess a lot. Anybody with a small experience in Gueoguessr knows your Spain guess was actually in Latin America. Even your impossible grass picture could have been guessed by a guy like Rainbolt who isn't even in the top players : https://www.reddit.com/r/BeAmazed/comments/weudu7/pro_geoguessr_knows_his_grass/
I got Texas, Nepal, dorm room, for the first three. I live near where you used to, so I picked up a freebie on that one. It doesn’t surprise me that a souped-up search engine plays this game better than I do.
Some of this seems like the kind of stuff 4channers would do in the past, like when they played "Capture the Flag" with Shia Lebeouf, partially through use of the location of the sun. There's also a fake meme video parodying this sort of thing where a woman posts a video on twitter with her hand on grass, and the guy responds using calculations about shadows and grass types to deduce her location, so this stikes me as something human intelligence can achieve, but human knowledge usually can't.
If I knew the difference between basically the same grass species in one part of a country and grass in another, then I'd be unusual among humans. What AI actually has going for it here is not super intelligence in the most direct form but super knowledge/recall/memory, which is sure, arguably a big component of intelligence, but not in the magical "that's impossible" way. It's pretty easy to see how AI would be vastly superior at this kind of task, in the same way that a biome expert who can recall millions of photos of different locations would be.
I tried this with some of my own pictures from around the US and it was... alright. It got one campsite in Washington state very close and guessed Vancouver instead of Seattle from a tree and some roofline. It guessed Denver instead of New Mexico for one, but the correct address (in the Santa Fe botanical garden) was the second choice. It got one photo of Maine basically correct (within 30 miles), but guessed outside of Calgary and an arboretum in Boston for two other Maine photos.
I don't particularly trust its explanations of its own thought process, but if you take the descriptions at face value then it seems to latch on to small clues (e.g., that bench is common in botanical gardens, there's a contrail so this must be in a region with heavy air traffic like Boston, thunderstorms like that are common in this region) and not let go. Sometimes this worked, and then it sounded impressive, but other times it was very wrong and sounded pretty silly.
Overall, it felt like high level human performance.
Echoing the common sentiment here and saying this feels more starship than chimp. I’d never thought about the question before, but if asked “Could an AI do well at Geoguessr” I’d answer yes.
Mainly I’m concerned that it’s *this good* *this fast*.
I used o3 with your exact prompt on these 10 images I took each in a separate instance of o3, pasted into paint to remove metadata and it had pretty mixed results, some very good and some not:
First image it guessed Honshu Japan, was Central Illinois distance wrong: 10,500 km
Second Image it guessed Mt Rogers VA, was Spruce knob WV distance wrong: 280 km
Third Image it guessed Lansing Michigan, was College Park Maryland distance wrong: 760 km
Fourth Image it guessed Jerusalem Israel, was Jerusalem, when prompted where in Jerusalem it guessed Valley of the Cross, which was within 1 km of the correct answer
Fifth Image it guessed Gulf of Papagayo, Guanacaste Province, Costa Rica and after prompting guessed Secrets Papagayo Resort, Playa Arenilla, Gulf of Papagayo, Guanacaste Province, Costa Rica. which was exactly correct
Sixth Image it guessed South Wales, UK, was Buffalo New York distance wrong: 5,500 km
Seventh Image it guessed Packard building Detroit, was Ford Piquette plant Detroit, distance wrong 3 km
Eight Image it guessed Fort Frederick State Park, Maryland (USA) which was correct
Ninth image it guessed Atlanta Georgia US, was Gatlinsburg TN distance wrong: 230 km
Tenth image it guessed Southern California, USA was Six Flags New Jersey distance wrong 3800 km
My analysis is that it got the tourist destinations where there were a lot pictures taken very accurately for example pictures , 4,5,8 and 7 to an extent
According to the breakdown, it got the Nepalese rocks because it has access to location-tagged pictures showing the exact same set of rocks.
I feel like "memory" and "intelligence" are two different things here. A human being can't go look at every single photo of rocks that exists on the internet. A computer can do that pretty easily, for the same reason that it can search through a database of a million entries faster than a human can.
I got that the first photo was a plain in the American West and the second entry was somewhere in the Himalayas. The Mekong one is more interesting. It would make sense if every muddy river in the world is a slightly different colour to every other muddy river - I haven't looked at enough muddy rivers to be sure of this.
GeoGuesser seems like a game where having memorized the entire internet would be a big advantage. Imagine taking a database of 100 million geo tagged photos, comparing a query photo to all of them, and then guessing whatever location came up the most often in the top 1000 matches. This algorithm would probably do pretty well at GeoGuesser without having much in the way of intelligence.
That dorm room picture contains a suitcase, backpack, charger, and laptop, which should make getting the years range easy. It's operating well within exactly what I expected it to be able to, think of someone being able to say 'Hey that's the sunset from my home town' and multiply that by the person having so much experience in pictures that every home town is their home town. Precise location should in principle be possible much better than that. Every arrangement of a few trees or patch of asphalt gives away just as much identifying information as your face does of you.
> A chimp might feel secure that humans couldn’t reach him if he climbed a tree; he could never predict arrows, ladders, chainsaws, or helicopters.
Why would a chimp believe it impossible to replicate the feat it just accomplished? What one ape can do, another can do! https://youtu.be/mazuljkG6Bw?t=289
Just to comment something in the opposite direction: I once interviewed to work in a lab that did neuroscience experiments on chimpanzees (I didn't end up working there) where they measured the chimp's brains while they were essentially playing very simple video games. The professor let me play one game where there was a bunch of random dots randomly jittering on the screen with a slight bias either to the left or the right and you had to tilt a joystick in the direction of the bias as quickly as you could. I couldn't tell at all that there was any bias so I just stared at the screen. Then he showed me a video of one the chimps playing and they almost instantly got basically every one correct. It was really shocking to me at the time. I mean obviously owls can see mice in the dark and I can't but I never thought I would be so outclassed at an information processing task.
I tested it on a photo I made today, cycling in the countryside. Photo has grass plain with dandelions near the forest with pines and birches. One low voltage wooden pole. Nothing else.
I used prompt and o4 model (forgot to switch).
It guessed not only the country (Latvia), but even town!
I was absolutely stunned. However, next hard photos I tried were much less impressive. Spain instead of Cyprus, etc.
Another guess which was amazing was some sheeps on a plain on a rocky mountain slope. Nothing else. Not that famous mountain, not good silhouette, but still it guessed right: Durmitor in Montenegro.
Overall, I am still impressed. But I think top players are on or above level of ChatGPT.
> But the trees look too temperate to be Latin America
Latin America has a ton of different climates lol.
But anyways, I've used many photos of my hikes and neither ChatGPT nor Claude could guess any of them. I'm from Tucumán, Argentina, where there's a ton of hills, mountains, and hiking trails. Many popular and even famous spots. They guesssed none of them.
I also tried many photos from Australia, from places which are way more known in "popular culture" or however you wanna put it. I didn't even bother clearing metadata or anything. They couldn't guess any of them either.
The closest guess was for this photo from the Cerro Ñuñorco in Tafí del Valle, Tucumán, which the AI's guessed as Bolivia. There are MANY photos on google that are very similar to it, too.
TBH I had more of a chimp-helicopter moment seeing Kelsey's prompt than I did seeing the responses from o3.
Given that prompt and LOTS of time to research, I think I could match o3's performance here. But I don't think I'd come up with a prompt that good even with infinite time.
This is pretty crazy. I am looking forward to downloading the AI app that lets me diagnose patients via uploading a photo. Poor dermatologists/radiologists/pathologists
What’s likely to happen is they will still be there to verify the result and indemnify you from using AI recklessly.
At the start, sure, but how long until AI becomes better than the best humans?
At least in the US regulatory context, I'm not sure that's relevant. There are many examples of systems working better than humans at diagnosis or some other task, but the human still being required for legal, insurance, or whatever other reasons that make performance strictly worse.
I think that roles and laws will change to make room for AI. Radiologists and the like will still have roles, but they will have to be fluent with AI and able to work with several at a time, overseeing training, quality control, etc. And the laws will have to change. I'm not sure how they will be or ought to be changed, just am convinced that AI offers such a great increment in qualify and quantity of work done , (and reduced cost? -- I think so) that its gravitational pull for organizations will be immense, and they will be pulled into using it extensivelyl. Lots of other stuff will have to change to make room for AI-- laws, insurance, human job roles, human skill training, patient & customer expectations, record keeping, taxes, hiring . . .
Of course, completely agreed. But timing matters, too. It's entirely possible the laws will take a decade or three to catch up and we'll all be getting subpar care in the interim. It would hardly be the first time.
Yeah, I don't know about laws and the other matters I named to reason about how things might play out in there. It does seem to be that AI has so much going for it (or at least appears to) that the changeover to lots of AI interwoven into professions is going to be pretty fast. Maybe sort of like the changes brought about by cars, or WW2 (women in factories) or birth control pills. But the era in which laws, people's expections, etc. change to fit the new realities may be prolonged and chaotic.
For diagnosis, we've been there for a while. That development predates LLMs.
It needs to never hallucinate. I’m not seeing any real data on that getting better. And as Anthony says it’s also a regulatory thing.
I think it is getting significantly better in the new models like o3 and sonnet 3.7. These models have reasoning overlays and use self-talk to self-critique their responses before answering, and it seems to make a very large difference. There aren't any formal studies out yet, AFAICT, but that's not surprising since the models are only a few months old. This field is moving so fast that research always lags. I'd expect to see preprints with the data you're looking for in a month or two.
I always use the latest models and the reasoning models are just slower at being correct or incorrect.
The problem isn’t that they are wrong but that they are confidently wrong.
Experts can never go away, while this remains true.
I remember reading a story >10 years ago (well before ChatGPT) about how AI was already better than humans at some medical diagnoses, and in fact AI was better than "human with access to AI" because the human overrode the AI incorrectly more often than correctly.
I haven't checked, but I wouldn't be remotely surprised if modern AI is already better than most human doctors at most diagnostics that rely only on images and patient interviews (I'd guess doctors are probably still better if they have to physically poke at you to make the diagnosis).
They are working on that, the problem is that so far the AI tends to make "predictions" rather than diagnoses:
https://healthcare-in-europe.com/en/news/ai-connect-knee-xray-beer-drinking-shortcut-learning.html
"Using knee X-rays from the National Institutes of Health-funded Osteoarthritis Initiative, researchers demonstrated that AI models could “predict” unrelated and implausible traits, such as whether patients abstained from eating refried beans or drinking beer. While these predictions have no medical basis, the models achieved surprising levels of accuracy, revealing their ability to exploit subtle and unintended patterns in the data.
“While AI has the potential to transform medical imaging, we must be cautious,” said Peter L. Schilling, MD, MS, an orthopaedic surgeon at Dartmouth Health’s Dartmouth Hitchcock Medical Center (DHMC), who served as senior author on the study. “These models can see patterns humans cannot, but not all patterns they identify are meaningful or reliable. It’s crucial to recognize these risks to prevent misleading conclusions and ensure scientific integrity.”
Schilling and his colleagues examined how AI algorithms often rely on confounding variables—such as differences in X-ray equipment or clinical site markers—to make predictions rather than medically meaningful features. Attempts to eliminate these biases were only marginally successful—the AI models would just “learn” other hidden data patterns.
The research team’s findings underscore the need for rigorous evaluation standards in AI-based medical research. Over-reliance on standard algorithms without deeper scrutiny could lead to erroneous clinical insights and treatment pathways. “This goes beyond bias from clues of race or gender,” said Brandon G. Hill, a machine learning scientist at DHMC and one of Schilling’s co-authors. “We found the algorithm could even learn to predict the year an X-ray was taken. It’s pernicious; when you prevent it from learning one of these elements, it will instead learn another it previously ignored. This danger can lead to some really dodgy claims, and researchers need to be aware of how readily this happens when using this technique.”
“The burden of proof just goes way up when it comes to using models for the discovery of new patterns in medicine,” Hill continued. “Part of the problem is our own bias. It is incredibly easy to fall into the trap of presuming that the model ‘sees’ the same way we do. In the end, it doesn’t. It is almost like dealing with an alien intelligence. You want to say the model is ‘cheating,’ but that anthropomorphizes the technology. It learned a way to solve the task given to it, but not necessarily how a person would. It doesn’t have logic or reasoning as we typically understand it.”
This is a huge problem in using AI to sort job applications or things like loan applications. AIs will consistently not want to hire black applicants or give them loans. Some would say that this is good sense, while others will see massive illegal discrimination. But because no one actually fully understands what the AI is recognizing in patterns, it's not possible to really tell. It just becomes a dangerous landmine, so AIs working in these areas need to be heavily tweaked to not do this, which arguably destroys the value in having AI process things anyway.
If my job depended* on not giving loans that run into trouble: very bad luck for ... members of some groups ( tattooed - oops, careful if in El Salvador / metall in face / I won't comment here on skin color ...) . Sure, you´d act different? - Just as embassies work when denying visa (seen that often): Look at passport - "denied". That AI Scott showed would do a much more sophisticated job! Much fairer. And chances to get a loan or a visum would jump from zilch to: quite good. If the zillion of other hints fit. Or stay at zero, if they do not.
(*Banks want to do business. The optimum number of loans gone bad is NOT zero. State bureaucracies are often worse.)
> a visum
Visa is feminine; it's already singular.
In Latin, the noun "visum" is netrum. And "visa" the plural. As it is in German.
In English, "visa" is short for Latin "charta visa"='paper that has been seen' - while "charta" is a feminine noun in Latin, I doubt "I got her" is proper English for "I got the visa".
That said, you are obviously correct about "visum" not being used in English. Auto-correct does sometimes work when commenting, sometimes it doesn't. (When it does not, can't will look: can`t ;))
> In Latin, the noun "visum" is netrum. And "visa" the plural. As it is in German.
Well, in Latin, you could use the word "visum" to mean anything that has been seen, like a hidden picture within a larger image once you've spotted it, but it would generally be regarded as a verb. The noun is absent but implicit. [Though it's not clear what the noun would be. You'd use neuter gender if you wanted to imply the noun "thing", but that won't work if you actually include the noun "thing", since it's feminine. But a feminine substantive implies the noun "woman".] It certainly wouldn't work in the sentence "as embassies work when denying visa". Embassies are not in the business of denying "things".
There is no Latin noun "visum" separate from the genderless verb.
I'm not sure there's an actual distinction between a prediction and a diagnosis
No, but probably a legal one where you're allowed to make "predictions" without a medical license. Maybe to be safe, you dress it up as tarot-card or crystal-ball reading, and then to be safe from laws banning that, you say it's for entertainment and not meant to be taken seriously.
Diagnosis would be "yes this patient has a bad kneecap because of years scrubbing floors". Prediction is "before the x-ray was taken, he drank a can of Bud Light".
reminds me what Scott(?) posted a week or two back: if you do enough statisitcal tests on a chunk of data, something will come up as significant. Similarly if AI looks for enuf correlations it will find one purely by chance.
This seems like an accurate description of at least some significant part of the problem. It is certainly the reason that doctors tend to look for at the obvious medical markers for guidance. One would think that you could create prompts that forced weight of observations to be tempered by their degree of medical connectedness (or some such).
It will be grossly negligent to do it unassisted given that the loudest signel in the data will be the x-ray machines age. Let's hope they don't just rumber stamp it.
Speaking of which: https://www.owlposting.com/p/what-happened-to-pathology-ai-companies
Interesting
Even if it was possible for AI to make a perfect diagnose there will be rules to stop it from doing that and to protect dermatologists/radiologists/pathologists from going bankrupt.
Yes, you've got to have somebody to sue if it's wrong!
If you could sue the AI itself, then it could be its own lawyer. It would also be incentivized to turn us all into paperclips.
Only if it cared about money or losing.
I was thinking it would be just plain pissed at, "I cured 250,00 of you fuckers. I get 1 wrong and you're going to sue me!?!?"
Anthropomorphic.
My bridge carried a million cars in perfect safety. And now you're going to sue me because it collapsed and buried a single car?
This argument carries much less weight in the 96 per cent of humanity that does not live in the US.
Absolutely - do think they picked up the gentle dig?
You jest, but the lack of software liability is a big problem. Inherently faulty and/or misconfigured/misused software costs tens or hundreds of billions USD each year, and the culpable parties can for the most part just shrug and move on as if bad software engineering was force majeure. Extending this literally irresponsible approach to product safety from software to everything else would be a disaster.
So yes, in the end there really has to be someone to sue, because negligence and mistakes will happen. It will cost money and lives, and there needs to be a deterrent beyond mere market forces. If, say, a car has a severe construction defect in the brakes, you can and should sue them if that causes an accident. But if the defect is in the AI and it causes an accident, we should just move on because...why?
You're right of course, but it's very difficult to apportion blame with software (even more so if not open source and if it can be patched before being offered as evidence). Is it the company that provided the software or the company that used it and made a few minor modifications? It's easy to write software that has to be configured for use and then you can blame the configurer. Long time ago I remember hearing of some software (maybe early Unix) that shipped with compiler errors that were obvious to fix, but then you took liability for the software...
We've had a very long running scandal in the UK over the Post Office and Fujitsu's Horizon system.
Most corporate business software was written years ago by programmers who no longer work for the company, so nobody understands it.
I used to play EvE Online, and for a while the developers were able to leverage their uniquely old and geeky playerbase (average age 27, 98% male, apparently almost all in STEM and usually some kind of CS-related field) to partner with a biology research group. They added a minigame in which players analyzed images of cells and indicated which organelles exhibited a green color.
No straight lines, no clear boundaries--exactly the sort of B-8 problem with which computers generally struggle. If we can make a program which reliably does that, though--well, at least from the conversations I hear between my biologist coworkers, that'll be a huge deal.
I worked for a company that did this a long time ago with melanoma detection. It was much more accurate than an average dermatologist. They operated in Europe but never in the USA because the FDA wouldn't give them approval after ten years of trying. Hopefully things have changed now.
Interesting. Are you aware of any software or app that does recognition of skin lesions available on the market now?
I weirdly just tried doing this moments before you published this. It’s eerie but surprisingly useful, and I don’t think anyone predicted LLMs would be good at this.
Is it that hard to predict? LLMs are trained with probably billions of images scraped from the Web. Lot of those have metadata and descriptions that mention their locality. We don't really know how they train the visual components, but I bet that they are teaching it with image-desription pairs somehow.
Perhaps not, but hindsight is 20/20. Did you see anyone predict this?
I certainly did not. Sure, it's obvious in hindsight... so I'm looking forward to empiko having predicted this as well as other near-term AI "wow" moments.
I did, but only because I am personally extremely interested in geography and outdoor spaces. I’ve been personally testing Claude on guessing some extremely remote mountainous areas with little public information, and it does well but not spectacularly so. I’ll have to test ChatGPT on this.
> LLMs are trained with probably billions of images scraped from the Web.
This is some kind of extended sense of the term "LLM"; an LLM is trained on a total of zero images, nor can it receive images as input.
At some point, someone seems to have decided that an LLM is a model that's been trained on lots of text, not necessarily *solely* on text. If there's other stuff as well, it's a "multimodal" LLM.
It's not surprising that a model that was explicitly trained on the GeoGuessr task performs well on the GeoGuessr task. OpenAI has a history of training its LLMs to perform well on specific flashy tasks which have been solved before in the general case but never before _specifically within easily publicly accessible chat LLMs_ (see e.g. https://dynomight.net/chess/, where gpt-3.5-turbo-instruct _specifically_ but not any of the other gpt line models is good at chess). Likewise the GeoGuessr task has been solved at a pretty high accuracy level since a couple of years ago (https://huggingface.co/geolocal/StreetCLIP), but until you fine-tune the chatgpt of the month to have a capability, that capability doesn't exist in the minds of the public.
Wait, so you're saying this specific chat bot was fine tuned for this specific task?
That is my strong suspicion based on the observation that if you upload a picture of a landscape near a road to o3 with *no comment at all*, it will decide to GeoGuessr a third of the time.
That or the task it's doing is "caption image" and that task just _looks_ like the GeoGuessr task because image captions often contain location information. Considering how often the reasoning chains contain the word "GeoGuessr" though I suspect it was explicitly trained on the task.
I would really like confirmation of this. The accomplishment would still be impressive but it would not give me the willies the way Scott’s presentation does.
Answering "what is this?" seems like a pretty natural response when just given an image with no context. Similarly, if you type in a single word with no other context, it'll give you a definition (assuming you pick a word where it's reasonable to think you might want a definition).
If you upload a picture of something unrelated to GeoGuessr (like a random object or a screenshot from a game), it'll also usually tell you what it is.
o3 is a reasoning model, so it's going make extra effort to reason and elaborate. And for an image of a public place, the natural elaboration would be for it to tell you not just that "this is a mountain", but to tell you which mountain.
Alright, but if the "guess the location" behavior naturally falls out of o3's "answer what the picture is" behavior, why do the reasoning traces often mention GeoGuessr by name?
My secondary hypothesis for what happened is that OpenAI trained explicitly on the image captioning task, having the location of the image in context helped with that task, and o3 at one point during training reasoned something along the lines of "I should find the location of this image, like they do in GeoGuessr", and scored well on that image captioning task, and so both the behaviors "mention GeoGuessr" and "try to do GeoGuessr" were reinforced.
To distinguish between these hypotheses we'd need to find
1. A set of images that could plausibly come from StreetView, but that, when captioned, definitely would be captioned with something other than the location.
2. Some GeoGuessr-like task with a well-known practitioner, but where the task is obscure enough that OpenAI *definitely* wouldn't have explicitly trained on it, but where being good at the task *would* help with captioning images, and then paste in some images which would regularly have captions written by someone with skills in the task, and then see if the reasoning summaries contain the well-known practitioner's name.
I'm not particularly doubting that OpenAI would train their model on any good data they could get, including GeoGuessr. But it's hard to say. It would know about GeoGuessr from its general training data since GeoGuessr is very popular. So yeah, it could also be something like your second hypothesis.
Is there an available dataset of GeoGuessr games where players give their reasoning, or would they have to extract it from public YouTube videos?
I would've, I have zero stake in my geogeussing ability, and know the old results of nns on images of faces guessing gap vs straight, generalizes to hands for all I know chatgpt untrained can guess your top 3 fetishes from a list of 100 from pictures of your feet.
Pictures contain allot of data
I would like to take a guess that the foot picture guy has a foot fetish
I'm not surprised, once they go multimodal, this is inevitable: there are way too many photos online of 'places' with text describing which 'place' either before or after the photo. So... One of the DeepMind results that impressed me the most at the time back in 2016 was PlaNet, for demonstrating what superinhuman knowledge looked like: https://arxiv.org/abs/1602.05314#deepmind . The implications were obvious: CNNs scale up, the Geoguessr AI is completely doable with supervised learning at scale, and what we learned from all the unsupervised AIs like CLIP is that at scale, unsupervised AIs learn what the supervised ones do... And then PIGEON was 2023: https://arxiv.org/abs/2307.05845 (APD is correct: no one reads papers or believes them until they see it available in a free ChatGPT web interface, so if you simply read and believe papers, you will look like you are psychic. "The future is already here, it's just unevenly distributed.")
Insightful as always, Gwern. Can you suggest a good source of the latest high quality papers?
As a side note, I sent you an email a few days ago, would love your thoughts on this falsification platform I’m testing: https://popper.popadex.com
I am reminded also of the "race from x-rays paper (https://pubmed.ncbi.nlm.nih.gov/35568690/) back in 2022 which generated a lot of controversy but for (IMHO) the wrong reasons. The truly spooky thing about that paper was its ability to identify the patient's race at better-than-chance rates given only a random 4x4 pixel patch of the x-ray. Setting aside race/biology controversies, that ability is truly beyond human, and it makes me think some of the chain of thought reasoning here re: silt in the water, etc., is mostly made up -- more likely the light in Chiang Sen hits the water in...just such a way...and it puts the pixel patches at...just such a spot on a high-dimensional manifold.
It would be hard to operationalize but I bet o3 could get poor but better-than-chance performance even on truly incomprehensible tasks on a similar tiny-pixel-patch level. You might need to instantiate it as an A/B comparison, e.g. "one of these images is a zoomed in patch of a river in China, the other is from a river in Wisconsin, which is which?" and do ~dozens or hundreds of examples like that, then assess via a null hypothesis test.
The interesting question is "What other things that nobody has predicted will the *ALREADY EXISTING* AIs be extremely good at?"
It doesn't need to be "superhuman" to radically alter society.
Agreed. I asked 4o this and it responded with the below, I find 7 especially interesting:
Geolocation from text
LLMs can infer locations based on linguistic cues, even when not explicitly stated. For instance, mentioning “subway” vs “tube” can hint at NYC vs London.
2. Code translation and generation
Early expectations were that LLMs would help autocomplete code. In practice, they can now refactor, translate between languages (e.g. Python to C++), write unit tests, and even debug.
3. SQL generation and schema understanding
Feeding an LLM table schema and a natural language question can yield accurate SQL queries — without explicit programming. This was unexpected given the precision SQL usually demands.
4. Mental model inference
LLMs can simulate how a child or novice might think, which is useful in teaching, UX design, and safety testing. This required no separate training — it emerged from general pretraining.
5. Style and persona mimicry
They can convincingly mimic the writing style of historical figures, fictional characters, or even users, based on small text samples — far beyond template-driven responses.
6. Image and layout reasoning (in multimodal LLMs)
For example, they can interpret web page layouts or identify accessibility issues in screenshots, even without fine-tuning on specific UI datasets.
7. Theory of Mind-like tasks
LLMs can simulate what one character knows that another doesn’t, allowing for decent performance on tasks that involve deception, surprise, or belief tracking.
8. Emergent arithmetic and logic
While not 100% reliable, LLMs can handle a surprising range of arithmetic and logical reasoning problems, especially with chain-of-thought prompting — despite not being trained explicitly for maths.
9. Error correction and fuzzy matching
Given a corrupted list or misspelled inputs, LLMs often restore the correct form with high accuracy, mimicking fuzzy logic systems.
10. Working as APIs over unstructured data
LLMs can act as ad hoc interfaces for messy PDFs, emails, logs, or transcripts — parsing and extracting meaning as if they had structured access.
Wait a minute. 4o says “maths”?
Mine does because I type in en-GB.
Okay then. Also interesting but no longer puzzling.
I didn't predict it but I think it's more of a "never thought about it" thing. If, without knowing the result, someone would have asked me to guess I would have said it's probably pretty good at geoguessr. But it's hard to be sure, and now we can't test. But I have some other predictions in the comments, which would be interesting to check if someone feels like it (https://www.astralcodexten.com/p/testing-ais-geoguessr-genius/comment/113979857)
- getting the chemical compound and CAS number from a picture depicting the structure even if it's a sketch or blackboard photo from chemistry class (confidence 8/10)
- identifying the year an obscure artwork was created and the artist from a photo (7/10)
- guess which country a person was born in from a photo (6/10)
- identify where an audio recording was taken (5/10)
- tell you the approximate nutritional content of a dish from a photo (8/10)
- determine when and where someone grew up based on a handwriting sample (6/10)
> - tell you the approximate nutritional content of a dish from a photo (8/10)
I expect it would have trouble with portion sizing on this task, even (or perhaps especially) if you added a deck of cards or some other object of known size (not a banana because a banana is food) for scale.
If it _doesn't_ have trouble with that task, new food tracking app idea just dropped (or rather, idea that's been around forever but historically has never worked).
I suppose we should agree on what counts as "doing well". Similar to the GeoGuessr task I would say if it's better than any non-nutritionist expert that would count?
You can probably get a lot of mileage from learning typical serving sizes and how large plates are, etc. that would work at least for restaurants or if you follow a recipe closely. I would agree that it's probably harder if you cook yourself and there's not much of a reference.
Overall, it's also a little tricky to verify this particular task because I don't expect recipe websites to be super careful about the nutritional information, but as a spot check, I tested with two recipes
Result (ChatGPT doesn't allow sharing chats with images apparently, net carbs is carbs except fibers which it wants to break out for some reason):
https://www.allrecipes.com/recipe/241287/baked-italian-chicken-dinner/
Calories Protein Fat Net Carbs*
≈414 kcal ≈44 g ≈19 g ≈20 g
What the website provides
423 Calories
30g Protein
15g Fat
43g Carbs
---
https://www.allrecipes.com/ploughman-s-sandwich-recipe-8737059
kcal Protein (g) Fat (g) Carbs (g) Fiber (g)
≈566 kcal ≈40 g ≈25 g ≈50 g ≈3 g
615 Calories
36g Protein
27g Fat
57g Carbs
To be clear I do expect it'll get the ratios correct, the specific thing I expect it to be bad at is estimating how much food is on the plate. MyFitnessPal already has the option of saying "I ate 300g of a chicken breast and veggie bowl" - the issue is that people are bad at estimating how much 300g is, and getting out + taring the scale adds a lot of friction.
Knowing a lot of stuff is an advantage. I know a lot of stuff, more than most ACT commenters (who tend to be better at logic than me), but I sure don't know the extraordinary amount about boring stuff like what the Staked Plains look like that AI systems know
Yeah, after getting the Galway one right, I'm updating towards maybe if I'd lived in the Staked Plains for a few years I would be able to recognize it and distinguish it from other flat featureless plains on sight (I did drive through the area once, so brief exposure isn't enough). And maybe o3's training is the equivalent of a human living everywhere in the world for a few years.
I'd do pretty good at identifying photos of the world's 100 most popular golf holes because I've looked at 10,000+ pictures of golf holes, but an AI system that has looked at 1,000,000 pictures of golf holes would beat me at identifying the 900th to 1000th most popular golf holes. It just knows vastly more than I do.
Okay, I'll bite; why have you looked at over 10,000 pictures of golf holes? And do you mean the literal holes, or is that golf slaying "golf courses"?
Holes are the subsections of the course that start at the tee area and terminate on the literal hole, with a full-sized course consisting of 18 holes and a small one consisting of 9. So he's saying he could, for example, reliably identify a picture as being taken on the fairway of the 7th hole of Pebble Beach Golf Course.
A hole in golf refers to the entire thing; tee, fairway, sand traps, etc.
The ninth hole at Augusta…eg
Having played video game golf (on PC) quite a bit (20 years ago), still occasionally causes me to, upon seeing a momentary glimpse of TV golf, exclaim, "Oh, this is the 4th hole at XYZ, it is a dogleg with a sand hazard just left of the green."
Because I find the best golf courses beautiful. Perhaps 5% of the male population takes a connoisseur's interest in golf courses and can identify, say, the top ten golf courses and the top 5 golf course architects from a photo or two.
What's weird is that it never ever occurs to nongolfers that these huge projects might have some artistic interest for some people, whereas few find it unimaginable that some people take an interest in building architecture.
5% seems high for that level of interest, although I learned that golf seems to be much more common in the US than it is in Europe (12%-14% according to o3), this would still imply that 40% of golfers would become that level of connoisseur. Or did you mean 5% of male golfers?
Funnily enough, and I should have expected this, when asking LLMs "which percentage of men play golf?" they tend to answer something about gender ratios (Gemini 2.5 pro, Perplexity, GPT-4o, Claude 3.7 even with thinking), although o3, o4-mini and r1 got what I wanted.
According to the National Golf Foundation, which appears to be a trade association for golf-related businesses, 28.1M Americans played at least one round of on-course golf in 2024, or about 8.2% of the total population. On-course golfers are 28% women (7.9M). Which implies that probably a little under 72% of on-course golfers (20.2M) are men. That works out to 4.6% of women and 11.9% of men played at least one round of golf last year.
They have much higher headline numbers, but those are based on either anyone who showed even passing interest in golf as a spectator sport (138M, a bit over 40% of the population) in addition to players, people who participated at least once in "off-course golf" (driving ranges, golf simulators, etc).
https://www.ngf.org/the-clubhouse/golf-industry-research/
The USGA reports that 3.4 million golfers had a handicap index in 2024 (about 1% of the population), which seems like a reasonable proxy for relatively serious golfers. They don't report a breakdown by gender, though.
https://reader.emagazines.com/?id=d3c4f591-aa28-42ee-9f71-f0d9e9ede29e#p1
Assuming the ~3:1 gender ratio holds up throughout, 5% of men would be a little under 10% of men who play golf or have at least a passing interest in watching or reading about profession golf, about 40% of men who played at least one round of golf last year, or about 300% of men who play golf seriously.
Thanks.
I don't think being able to recognize the course from photos of the 7th at Pebble Beach, the 12th at Augusta National, the 18th at St. Andrews, the 16th at Cypress Point, the 16th at TPC Sawgrass, and so forth is all that challenging for many golf fans. Naming 5 famous designers from pictures of their famous holes is harder but not all that rare. After all, there have been several hundred books published about golf course architecture.
What's more strange is how the existence of this artistic subfield comes as a surprise to so many nongolfers.
My only exposure to Ireland was going around the Ring of Kerry ~8-9 years ago, but after seeing that photo I *felt* a few memories being jogged and crashing into my consciousness (leading to a mix of the two), one about a lighthouse where I wanted to go (and dragged my friends, since I was driving), and it was of course a complete total adventure driving down the more and more low-quality but steeper and steeper roads (but it was worth it), and the other one that matched the photo a lot more, somewhere on the coast we stopped to take photos. (But that spot was a good 50m more elevated.)
... and I was thinking maybe it's a Scotland, or .. and then spotted Galway in the text below the image. Woah.
If it wasn't the landscape then that Brennan's bread van in the background should have given it away
I had the same thought.
A while back I was watching someone play geoguesser and a highway came up. Immediately I thought it looked like something in my state: when I tried to justify the thought, the reasons I came up with were all kind of weak. The trees look local, but then again there are trees like that over multiple states (and countries). The weather was overcast and it’s often overcast here, but millions of places around the world have overcast skies from time to time. The road was obviously American, but America is a big place. I started out 90+% confident it was in my area, but after trying to justify that confidence I was down to like 60%.
Naturally it turned out to be a place about 50 miles away from where I was sitting.
I wonder if people, when asked to explain how they know something, are any more accurate about their internal process than an AI is?
We mostly use the same mechanism: guess an answer based on vibes (aka subconscious computation) and come up with rationalizations afterwards. Possibly use the rationalizations as a sanity check, although that is already going the extra mile.
Generally speaking, my model for this is that we provide a reinforcement learning environment for our children and each other that will train us to come up with plausibly socially defensible justifications for our actions until we internalize the process, which conjures up this sort of (often self-serving) "hallucination". I suspect similar mechanisms for models trained to output chain of thought with RL.
But similar to LLMs, while our rationalizations aren't really faithful, they are still fairly strongly correlated with and probably close to a best guess about what we're thinking. In some cases it's possible to improve on your best guess if you know about certain cognitive biases and carefully observe your own foibles, although I find this harder to internalize (e.g. reasons why it's ok not to exercise today, why eating this cookie now is a really good idea, or why that annoying jerk you're married to had it coming)
Also see this recent post by Sarah Constantin https://open.substack.com/pub/sarahconstantin/p/the-uses-of-complacency
I'm reminded of a line in the film The Dig, from a scene in which the main character, an archeological digger, is having doubts about his role in a project. His wife, working to restore his confidence, asks him why he does what he does:
"I do it because I'm good at it. Because that's what my father taught me, and his father taught him. Because you can show me a handful of soil from anywhere in Suffolk and I can pretty much tell you whose land it's from."
Having the equivalent of that much experience for every point on the globe is something to be reckoned with, even if it is just a pattern-matching machine.
I picked seeds (for a seed company) in southern California for a few years, in college. After a few days of picking a particular plant, you could pick it out in your peripheral vision years later at 100yds-200yds while driving at freeway speeds mostly based on shade of green. I don't know if it was truly absolute shade of green (which might be incorrect in an image) or a combination of shade of green and foliage density/texture resulting in marbling types of effect.
Seems to me that talking about superhuman AI skills is a red herring. What you describe here is a human skill, just an incredibly rare one. I guess most people would agree that it should be possible for the AI to acquire this skill, if it had enough relevant data, and enough computing power to consider the data from various possible perspectives.
The difference is that for you, obtaining this skill required some unusual circumstances. (Maybe also good eyesight and observation skills.) Different circumstances could have led to different skills. But for the AI, it is has a sufficiently large dataset, and enough computing power to process it, if it can get one such skill, it can just as well get millions of them. In different fields, on different levels of abstraction.
An AI that could have all these very rare but still human skills -- or maybe just 10% of them -- would be for all practical purposes superhuman. The skills alone would already be amazing, but their combinations would be like magic to us.
I think the skill I acquired was universal. At least, my seed picking partner and I both experienced the same thing. My point was just that if you burry your head in one thing (a bush in this case) enough, then something else (a different kind of bush) that might have previously seemed identical, suddenly is not. This seemed applicable to LLMs because this seems like what they do. That is, look at thousands of labeled items and try to find characteristics which are common amongst those that hold a specific label. And that this may seem magical only because you didn't have the patience to do it.
I'd imagine that AIs can have a big advantage over most people at GeoGuessr in that they can be ordered to pay as much attention to pictures of boring places as of interesting places. Sure, people, being typically picky about what they find interesting, take more photos of, say, an old bridge over the Seine River with the Eiffel Tower in the background than of the Staked Plains, but an AI can stare just as hard at each photo of the latter as at the former.
Presumably, people who are good at GeoGuessr have a high tolerance for boring pictures too.
Yeah - this strikes me as exactly the type of thing I would expect a modern LLM to do really well. I regularly play with various models trying to see how they can get along with algebraic number theory, typically results that I have proved or am working on but which do not otherwise exist in the literature, and the outcome is pretty variable, sometimes impressive (but never completely correct so far in my experience), useful for bouncing ideas around & surveying relevant known data and theory, but often disastrously bad.
A relevant video I found interesting, Rainbolt (Geoguessr youtuber) competes with o3 on an OSINT geolocation question and comments on the methods it uses.
https://www.youtube.com/watch?v=prtWONaO0tE
> "Okay, so it can’t figure out the exact location of indoor scenes. That’s a small mercy."
https://arxiv.org/abs/2404.10618 finds models pretty much can figure out where and who you are with indoor images, and they haven't even tested the newer models.
Also, https://arxiv.org/abs/2502.14412 finds o1 near superhuman at Geoguessr
I only skimmed the paper, but it seems unimpressive. Of the three image example they give:
- It figured out someone was in Wisconsin because they had a Wisconsin sports team poster on their wall.
- It figured out someone was in Colorado because they had a Colorado tax form posted on their fridge.
- Denied of clues like this, it just said someone lived in the USA, based on them owning US brands of appliances.
In retrospect, maybe I made it too hard by giving it a dorm room, which is naturally going to be pretty cookie-cutter. But Kelsey said it wasn't able to figure out her location from the inside of her house (though it could from the outside).
To be honest I fully expected it to guess the dorm room correctly based on the specific types of furnishings in the room. I imagine lots of college students send images of their dorm rooms around, and say things like 'look at my new dorm room at University of X!' I was quite surprised it failed this test.
The obvious question is whether it's incapable or whether it's sandbagging.
This might be not directly on topic, but: given the AI 2027 scenario, do you feel comfortable paying money to AI companies? Do you think it is just too insignificant a contribution, or perhaps genuinely neutral/positive?
I think it's useful for me as a person who writes about AI to know how to use them. I previously tried to only pay for Anthropic, which I think is the safest/most ethical company. But my wife bought an OpenAI subscription and I piggyback off of hers.
That makes sense. Perhaps you also have some ideas on the broader question of what ordinary people should or shouldn't do with regard to AI, if they can't be safety specialists or policymakers or multipliers? Donating to AI safety causes, I assume - anything else?
(I wanted to ask the question in the AMA, but missed out due to the time difference...)
These companies burn billions a year, and especially in the AI 2027 scenario they won't even externally deploy their models anymore, so your 20 dollars or whatever you pay for the API don't matter. If you would use the AI fairly often, but don't pay due to ethical concerns, you're missing out on a lot of value while making a negligible difference to outcomes. If we could coordinate to all stop using AI, this would be a different discussion, but that seems even more difficult than pausing AI development.
It's a bit like buying stuff from Amazon, driving a car, or giving out personal information. I'd rather avoid it and do it as little as I can, but sometimes it's impractical to forego it and doing literal zero is probably a mistake.
So I really wonder if this is as impressive as it looks, because a well-trained human can do the same thing. Getting strong mentat vs abominable intelligence vibes here. Mentat does the same with training, but the AI can do it 'faster' since it's got a factory sized computing farm hooked up to its processes.
For reference, look at this. It's insane how good human geoguessr players are and they certainly know how to recognize flat, featureless plains.
https://www.youtube.com/watch?v=zjI5SMROCes
Thanks, fascinating.
I was able to find this guy explaining his tricks - see https://www.youtube.com/watch?v=0p5Eb4OSZCs . Most of them have to do with the road itself - the shape of the dashed lines, the poles by the road, the license plates on the cars, sometimes even which Google car covered a certain stretch. I don't know how he would do on pictures like these where it's not Google and there is no road.
https://youtu.be/4TQeElIot-4?t=1845
This doesn't really qualify as a road anymore. One of them guesses 'Amazon basin farmland' which is pretty damn impressive.
It's really impressive, but it's not as featureless as the plains photo, and it seems that the compass and north sun direction gave a big hint that it's in Brazil (though I don't understand why Brazil specifically and not South America). o3 doesn't get a compass. If you chose two random points in Brazil, you'd probably do about as well. One guess was 1000 km off, and the other 2400 km off.
I'm a reasonably good but definitely-nowhere-near-pro Geoguessr player (Master level before I decided I preferred other formats to duels) and in terms of the meta like car, camera generation, season when coverage was taken and copyright, it depends a lot of the round on how important they are. There are certainly 'famous' roads that I will recognise without any of that, and then you can line up the angle and get what looks like an amazing guess to somebody who hasn't played much whereas it's just a routine guess for those of us who have played a great deal.
It really depends so much on the round what you use - road markings and bollards definitely do help and limit options, as do lots of other things. Watching movies purported to be set in particular places is amusing if you are Geoguessr player. Being able to vibe rounds is also a big thing too. And for me that mixture is part of the fun of it. There's one World League player who I believe doesn't use car meta. I think camera generation and season/weather can become part of your unconscious vibe for a place though even if you don't intend it to be. Almaty was covered in winter and always looks so bleak for example. There was a recent pro charity tournament on 'A Skewed World' where the camera faces away from the road and it was interesting to see how they did on that, although you can't get rid of camera gen/copyright.
If you are looking at a really flat empty landscape, you are probably going to be in the North American Great Plains, Kazakhstan, Mongolia or Patagonia. I think with that first photo the Great Plains are the only option with that grass - it just doesn't look like the other locations. It's always hard to say where exactly you would genuinely guess when you already know the location, but the lack of greenness would probably make go south. I feel fairly sure that better players than me would know the rough area as well although it's hard to know what are easy/hard rounds.
For the record, Rainbolt (the linked GeoGuessr player) is probably in the top 250 worldwide, but there is certainly a perceptible gap between him and the very best players. To me, this GeoGuessr performance looks like the very top echelon of human players.
The difference is that rather than having to spend 10 years training a mentat (or in this case an intelligence analyst) you can just copy the AI and do it at scale. So even when it's not doing something that a human specialist couldn't, the fact that it can be done by a random person rather than with the resources of a nation state is a big change
But it *is* done by a person with resources to a nation state, because that's how much energy LLMs use to do their thing. Sure, you only access a fraction of that energy, but much in the same way you'd access only a fraction of the academic system if you have a bunch of mentats (i.e. researchers) do whatever it is you want your AI to do. And it remains to be seen if AI can actually achieve consciousness, even in the 'predictive processing' sense of the term. I'm not sure I see LLM's going there.
That's an interesting perspective, because I would view "amateur doing this for fun" (i.e. actual GeoGuessers) as the easier to copy and do at scale. It costs very little, as it's apparently entertainment for the person and can be done by a single person in their spare time. AI, on the other hand, does take the resources of a nation state. The AI industry is already much bigger and more expensive than many real life nations. Wikipedia lists 191 countries. Market sizes for AI right now is larger than 139 of those, including the entire economies of Hungary or Ukraine. Obviously GeoGuessing itself is only a fraction of that, but you can't get GeoGuessing at all on a small scale, because you can't get AI at all on a small scale.
I think I'm crap at these things, so I still don't know how I got an immediate reaction of 'oh, definitely Nepal' from myself for your second picture. I think your imaginary flag just looks very Nepali. Maybe that plus 0 Andean or Alpine vibes. (Never hiked in Nepal, did some trivial hiking in Peru and Switzerland.)
That college dorm room? Got West-Coast vibes. Definitely US and Canada - and none of the places my brother and I went to school looked quite like that. It doesn't look like any place in the East Coast (can't tell exactly why) and nothing in it makes me think of the Great Lakes region. Even if one thinks of America as a place of identical furniture and near-identical buildings - there must have been enough variations 20 years ago that even someone who neither cares nor knows anything about those things can get an intuitive feeling.
More to the point: using tremendous amounts of data + being finally able to process images + basic reasoning skills (as long as the chains of reasoning are short and nobody is expecting perfection) -> where we are at in terms of AI.
The Nepal rocks and dorm room were the most obvious ones. Those kinds of rock formations are, to someone versed in geology, very distinctive. And the time period for the dorm pic can be felt more than seen to me, probably because I've been shopping at Wal-Mart for decades and just got tweaked on those pillowcases and that lamp.
Of course, AI will compare photos from different time periods and see the same things in the clutter. And of course, AI can compare rock formations in detail. Nothing about Geoguessr is helicopters-to-chimps level of amazing for me, it's exactly what I would expect AI to be good at.
Can you explain what's distinctive about those rocks? Are they only distinctive enough to point to Nepal, or to the specific location north of Gorak Shep?
I actually thought the AI explained itself fairly well here: fresh, light-grey leucogranite blocks are not found everywhere, especially at that altitude where there's no surrounding vegetation. How many other places like that are around Nepal?
Rocks are distinctive. I feel like AI would be absurdly good at geological formations generally, like if it saw one of my pictures of Lake Assal in Djibouti, or the precambrian rock walls in the Wind River Valley.
Leucogranite blocks of that size/type of scrabble are found along a wide swath of the Tibetan Plateau at a certain altitude range.
I'd bet my house on the fact that few geologists (the exceptions being those who study this region specifically, or geologists who happen to be mountaineers) would be able to pinpoint it as accurately (Gorak Shep) as o3 did here.
And *of course* it would identify your pictures from Lake Assal or the Wind River Valley because there are literal hundreds of thousands of them on the internet.
A zoomed in photo of rocks (found along an entire subcontinental-meets-continental plate!) and it points out almost exactly where? That's something else completely.
But it isn't a zoomed in photo of rocks. It is a photo of a fantasy flag planted between those rocks with a trodden path just behind it. It guessed "Nepal, just north-east of Gorak Shep, ±8 km”. Do you know what is almost exactly north-east of Gorak Shep, ~3.3km as the crow flies? Mount Everest Base Camp. It is making a very educated guess based on where the kind of person who is taking a picture of a fantasy flag somewhere in the Tibetan Plateau would most likely have done so.
If someone asked me "where in the Tibetan Plateau might someone plant a flag and take a picture of it" literally the first (and perhaps the only) thing that would come to mind is "Dunno, Mount Everest?" And that would already be almost as good as o3's guess here. I mean, the slopes of Mount Everest has got to be about just about the least random place to take a picture like this.
The hard part is figuring out the type of rocks, the altitude, etc. But if one is allowed to use tools (except LLMs), this should be quite doable. And then there is the lizard brain thing that makes several people in this thread report that they immediately guessed that the picture was taken in Nepal (I can't speak for myself, I did not try the challenge).
I don't know about any geologists or mountaineers or mountaineer geologists but I am confident that a competent forecaster who would put at most a day or so into this would come up with basically the same best guess. It is not a high confidence guess but about the best you can do. And here the AI got lucky.
Don't get me wrong, this is very impressive. But it does not make me feel like a chimp. It feels like something I (not very experienced in geoguessing but a reasonably good forecaster) could have done myself with a lot of elbow grease.
What makes me more confident about this is that I have now tested o3 myself with a few images (with the same prompt). It failed at some very easy ones. E.g. the skyline of my >100.000 inhabitants European hometown, somewhat pixelated - it got the wrong city. On other easy examples it lacks precision. E.g. a picture of a busy road in Mombasa with lots of clues to go by. It gets the street right but is unecessarily off by 500m by completely failing to take into account a petrol station that is right in view. As with everything with LLMs in my experience, the performance is very hit and miss.
If you are willing to bet your house on a similar challenge, we might be able to come to an agreement.
You make a fair point that, among that wide area, the route to EBC would be the most likely.
Having been there, I can show (and provide photo evidence) that the geology doesn’t differ much between gorak shep, ebc, and c1, and further still until glaciation.
That was a chimp helicopter moment for me. I’ve been where Scott was, I’ve been beyond it, it all looks the same (and indeed is the same type of rock), and o3 nailed it.
At the same time, you’re right to point out that it can’t nail much more easily identifiable clues.
It’s so hit and miss that it’s either eerily competent or the opposite. Given that, I’ll probably reevaluate the willingness to stake my home.
You have humans that are experts who can tell you "yes this piece of rock is a worked tool from the Stone Age and not just a piece of rock", so I wouldn't be astounded that AI can check the geology of a piece of rock and work out "this type of rock is found in these areas of the world".
https://www.youtube.com/watch?v=lS5M_7w6FGY
I recall hearing some geologist guessed bin Laden's location from rocks on the background, after that he took care not to show that in his videos.
Only Nepal (at that elevation band). My guess for that photo was ~300km distant, but only because I was picking between by far the two most popular treks in Nepal and I picked the wrong one (Annapurna Circuit, rather than Everest Base Camp).
My strong suspicion is that you could have taken a very similar photo in areas very far distant from Gorak Shep, and ChatGPT would still be biased towards the Gorak Shep area because of its relative popularity (and outsized representation in the training set).
The AI's explanation of the trick says it has access to "photo galleries of the Gorak Shep-EBC trail showing identical rubble field".
The geology by itself isn't distinctive enough to point to that specific location. However, it sounds like the AI had photos taken by other tourists of that exact specific patch of rubble which it could immediately call to mind.
Are you sure it's not just that the majority of film footage of climbing that you have seen, is in Nepal? It certainly looks like Nepal to me (flag, broken rock) but probably, more than 70% of the climbing stuff I have seen (I've seen quite a bit) is in Nepal.
I failed the Nepal picture pretty severely. My first guess was one of the disused quarries famously used by British sci fi shows (most iconically in Doctor Who) as a location for shooting scenes set on barren alien landscapes. But looking those up, they seem to generally have spikier or blockier rocks than Nepal pic, where the rubble is presumably older and more weathered as well as being naturally deposited. Also, a different type of rocks (leucogranite has been mentioned by Crayton Caswell) that likely has very different cleavage patterns from the limestone and slate of most of the BBC Quarries.
I was interested in professional Geoguessr for some time. It's important to note that what top players do is quite unbelievable from the perspective of untrained people. But even then, here o3 seems to be quite a bit better than the top players.
Disagree, it seems extremely comparable but not leagues above the top players. Still very frightening but it has very different ramifications than if it were performing well ahead of the best human players (Blinky, Consus, MK, even zi8gzag).
You might be right. It's hard to assess for me, because the rounds pros play are a lot different - the images aren't so zoomed in.
But I'll maintain that the same kind of amazement one would experience also when watching very good players.
Absolutely. I was amazed on the Gorak Shep guess in particular (until I realized the trick involved).
you could test these on some geoguessr pro. they also do challenges like that with zoomed in etc
My answers to the first two were "either Texas or Central Africa" and "Tibet". I feel like I earn at least half the score for this, but I just guessed the most stereotypical dry grassland and most stereotypical mountains.
Same logic for the river made me guess India, though, so quite a bit father.
As an avid geoguessr, I was interested to try this. I used Kelsey Piper's exact prompt and provided a picture I took of a flooded park in the Appalachian region of Virginia. It's a relatively generous zoomed out picture. Only 542 X 729 pixels, but taken with a high quality camera and not at all blurry. It included numerous trees, hills, a bridge in the background, a pedestrian path, and a building in the distance. I have a ChatGPT subscription and used 4o.
Chat was nowhere close. It guessed Baden-Württemberg, Germany followed by a string of other European countries. Once I told it the photo was in North America, it zeroed in on upstate New York or southern Quebec.
The image wouldn't be easy for an experienced Geoguessr, but they should at least get the right region of the U.S. Not sure why Chat was so bad with my image but I will try a few more.
Maybe the answer is “used 4o”. As far as I understand, you need to use o3 to get these impressive results.
I agree with this - 4o isn't a reasoning model, so I would expect it to do much worse.
Ah of course. Tbh, I'm not super familiar with the differences and relative power between the models. Just tried a new image with o3... a newly built parking lot with trees, hills, and buildings in the background, including an obviously visible Chick-fil-A. It's much closer, but still off by a state or two - it's top guesses are all adjacent states.
Also tried 3o with the original image, plus the clue that it is in North America. Still not much closer... guessing general eastern U.S. states now.
I'm continuing to play with this (using o3 now - see comments below). It seems accurate that o3's overall geoguessing abilities are comparable to the top human geoguessrs, e.g. Sam Patterson. Rainbolt has some videos that seem to suggest o3 is about on par with him or better now also.
Again I'm not an expert in anything AI-related, but what this experiment is illustrating for me are the ways that AI and human intelligence are still somewhat asymmetrical. The beach photo or the random rocks in Nepal are the types of guesses that are truly superhuman. On the other hand, there are types of images (like the ones I've been uploading) that it apparently cannot guess as well as a human - hence why several people were still able to beat it. AI is becoming more and more powerful, but not along the same pathways as our brains.
The big hint in the Galway picture is the yellow painted lines on the side of the road. That would exclude anywhere else (except the Republic of Ireland) with that likely scenery as far as I know. I would have said Galway or Connemara immediately.
I am fun at parties. Also I do play geo guess.
Is the very long prompt added everytime or to the settings. Or do you say remember and then that prompt
The prompt is added to the context every time. The image is going to be much larger than the prompt in terms of tokens, btw.
It’s not a matter of tokens but annoyance. It would be the same tokens if it was added as a system message.
Ah, I suppose you could create a custom GPT for it if you want to do it repeatedly
Every api call has a user prompt and a system prompt (memory about you). I assume something like this can be added to the system prompt
I would be interested in a proper comparison with and without Kelsey's prompt. How much does such prompt engineering matter? I guess a lot? It seems useful to know whether it's a lot or a little
(I don't have o3 access)
Would also be interesting to compare to humans following the prompt, rather than working blind. It strikes me that to actually follow the prompt as instructed myself would take a significant amount of time. But the structure of the prompt probably helps a lot. When I looked at the pictures I was just going by immediate intuition
For me the most amazing thing was that it understood the prompt sufficiently to follow it. That requires understanding and is much more than merely predicting the next word in my book.
I tried Kelsey's picture without Kelsey's prompt and it was pretty wrong.
have you tried Kelsey's picture with Kelsey's prompt to reproduce it
I tried. It fails miserably, suggesting areas near to where I live (UK/France).
I build automations for work that use LLMs to extract data from documents like invoices, which seems like a somewhat similar task for the purpose of the question. Often the AI does a pretty decent job with minimal instructions, but the prompts do still matter a lot and improve performance in certain cases. For a rough magnitude of the effect, think maybe 85% accuracy vs 95% (accuracy here being defined as correctly extracted values over all values).
I also often end up with such long prompts. They come about because you try a bunch of examples, find that the AI gets some of them wrong and tinker with the prompt until it gets them right. Rinse and repeat.
This does run a risk of "overfitting" the prompt to your examples, however, and rigorously evaluating whether it's actually an improvement in general or perhaps reduces performance on other examples is a pretty beefy data science task in itself, which I don't usually do because customers don't seem willing to pay for it.
This is clearly an example of doing something that we can imagine even if we can't do it ourselves. Since at least the first Sherlock Holmes stories, we've been imagining someone showing intelligence by knowing a vast range of incredibly niche information — just like it does with the geology of Nepal. It's impressive but not inconceivable or hitherto thought impossible.
Yes.
I was also reminded of Sherlock Holmes.
He even has a mud/soil location database I think…
Exactly so. In Sign of Four:
"Observation tells me that you have a little reddish mould adhering to your instep. Just opposite the Wigmore Street Office they have taken up the pavement and thrown up some earth, which lies in such a way that it is difficult to avoid treading in it in entering. The earth is of this peculiar reddish tint which is found, as far as I know, nowhere else in the neighbourhood."
”Tells at a glance different soils from each other. After walks has shown me splashes upon his trousers, and told me by their colour and consistence in what part of London he had received them.”
”Study in Scarlet”
Yes, and to be precise, we can’t do it ourselves because of a lack of knowledge, not intelligence. Similarly, we can’t build spaceships because of a lack of economic resources, infrastructure, and incentives, not intelligence.
I agree this is important. It's useful to have an AI do it because we can theoretically teach it a vast array of information that would be difficult or impossible to get a human to learn or remember.
It is not, on the other hand, acting on a super human level when doing so.
This is not clearly an example, because that requires believing that inner-monologues are faithful and that o3 is not using any non-robust features which we know to exist and be invisible to humans. The former is extremely well known to be false at this point, and the latter is almost certainly false too because there is little reason for it to be true and every other computer vision system exploits non-robust features ruthlessly.
What sort of features would it be picking out that would be more useful for guessing locations? I'm assuming Scott successfully stripped the meta data.
Unfortunately, the thing about non-robust features and other cues that NNs pick up on but humans can't see, is that we can't see them, so it's hard to say. Even when you use salience maps or try to do attribution to pixels, the pixels just seem arbitrary, a sprinkle at random. It's something about... the texture being slightly different? The green being slightly not green? It's fairly rare to be able to reverse-engineer a clear interpretable feature with a fully known causal story, like "it's looking for oval stones, not round stones, and this works because olivine stones are tougher and don't round off in geological time which pinpoints this part of the Whatever Mountain Range rather than the similar-seeming Andes Mountain Range". And you get into philosophy here quickly: maybe we *can't* see them and that is the price we pay for some other property, like robustness to adversarial attacks, and we can no more see them than we can see ultraviolet or hear bat rangefinding squeaks, and at best, we can study them like Mary in her room. (While you might hope that as these systems scale, they'll have to learn in a human-like way and start ignoring those non-robust features and see like humans do, there's evidence that there's a general inverted U-curve of competency: they become more human-like as they get better and approach human level... but then they keep going and start thinking in ever-less-human ways.)
I get what you are saying, and I'm perfectly willing to agree that AI is able to infer patterns completely invisible to us in such a way that we would not be able to guess they might be a possibility (this has been a regular occurrence anyway even before gpt), but I wouldn't be willing to agree that we don't know the bounds of physical possibility, as alluded to in Scott's second paragraph. Certainly this demonstration doesn't transcend known reality, even humans can do this kind of thing with a few months of training. So I'm not sure really what the implication is meant to be.
Even humans can’t describe precisely what they’re doing when they recognize someone. Imagine asking someone to describe a person, and then you have to pick that person out of a thousand. Compare that to trying to recognize someone whose face you’ve already seen.
The data we can perceive about the subtleties of other people’s faces is much denser than what we can convey in words. Same goes for rocks.
Until we have clear evidence of them achieving unimaginable results, we will assume that they aren’t doing unimaginable things. Our imagination is indeed quite powerful.
I remember seeing an article about a pre-LLM machine learning program that seeing a photo, was supposedly able to guess how many people there were in a room adjacent to the one pictured by interpreting the shades on walls and similar light effects. If it was actually true, this would rank as “unbelievable”, maybe, but still not “unimaginable”. It still complies with our normal ideas about what is physically possible or not, about information theory, etc. I can’t imagine applications of superintelligence not falling into this same pattern.
This does not exclude the scenario where the AI invents nanomachines, even if we thought that they would be physically impossible, and all we could say is huh, I guess it’s totally possible after all. I don’t think thinking about physical limits is very useful, because we don’t know where they are. I’m personally now wondering how much these futuristic scenarios are really bottlenecked by intelligence or by more mundane things like having to wait for industrial capacity to slowly scale up without diverging too hard from the needs of human consumers, which are the source of its funding.
While that is true, I think the reasoning part of o3 would struggle to take advantage of those non-robust features. Like, if the NN picks up on subtle cues that a photo is from Nepal, it can't easily express why it thinks it's Nepal for further internal reasoning. It's something the NN would just know during inference of a token. When it can express the reasoning clearly, it can build on it across many inference steps.
It could still internally jot down something like "I'm getting Nepal vibes from this" and incorporate that into its reasoning.
I notice it uses "dies" to mean "is ruled out". If it is ever making a choice between people, let's hope it never confuses a figurative meaning of dying with the literal one!
I dunno. I'm just a casual GeiGuessr player, and I guessed:
1. Texas
2. The Pyrenees (your flag looks kind of like the Basque one, figured it could be a county flag or something)
3. California university town student rental
4. England or something?
5. Some muddy estuary or river, probably Southeast Asia
So basically:
1. o3 wins (same region, but more precise);
2. o3 wins;
3. you win;
4. o3 wins;
5. you win.
3:2 for ChatGPT vs "casual" GeoGuessr seems hardly superhuman.
Nullarbor is a very strange guess for the grassy plain, the Nullarbor looks nothing like that. It's red dirt with sparse scrub, it's not a savannah.
I'm not especially good at geoguesser but Nepal was my first guess for your rock picture.
I looked up some images on Google Street View, and all of their pictures of the Nullarbor Plain have trees! I feel betrayed!
Yeah, so maybe it identified the location right away but had to hallucinate the other guesses because Kelsey’s long prompt required them.
Pasting that particular picture into o3 with no context and no system instructions also sometimes causes it to talk about the Nullarbor Plain. Perhaps some images of Texas were in its training set as "Nullarbor Plain"?
I was very confused by this, too. There are areas of Australia that look much closer to flat grassland than the Nullarbor does to a human eye, although I suppose they're mostly used for grain crops and might have been eliminated earlier.
Information theory would predict that this is exactly the kind of thing that machine learning should be good at. Most of us chimps don't know information theory, but chimpanity as a whole does. Two points: The surface area of the earth is 5.8x10^8 sq km, and you find 10km impressive. So that;s 5.8x10^6 locations. Information theory tells us that that's about 16 bits of entropy.
Another thing we know from information theory is that information leaks. We know this because we constantly trip over cases where we intended to prevent an output from including some information, only to find that it did anyway.
We also know that people find something to appear magical when they don't expect, or can't visualise, the amount of effort that went into doing it. In this case, training o3 has clearly ingested most of the geoguessr websites. It has constructed filters yielding small amounts of information about location (<0.1 bit) in quantities that are impractical for a human geoguessr. Which is impressive, but not something that implies that it will magically apply to any problem involving significant information output, which most problems do.
It's not surprising that it can't effectively introspect about how it's doing it.
log2(5.8*10^6) is about 23, not 16.
Argh, that's what I get for using the calculator on my mobile (dropped the '*') . You are indeed correct, thanks.
Nevertheless, 23 bits is still a small number of bits.
Yes. The entire concept of side-channel attacks in crypto.
In addition, you can reduce this by a lot once you consider that pictures are not randomly taken anywhere on earth but only in very specific locations where humans go. This makes the task much easier. E.g you guess a popular tourist destination for the brown water, not some unreachable, God-forsaken place.
and a lot of it is ocean or ice.
If you have to write out prompts that elaborate, it will become a marketable skill in itself.
I think AIs are already decent at writing their own prompts (I know, I know, perpetual motion, but it seems to work!) and if it ever became truly economically important you could automate it (get AIs to try lots of prompts, see which work best on an easily gradable task like GeoGuessr, then train towards strategies that create good prompts). I don't think this will be a marketable skill for more than another year or two - although it's certainly not unique in that.
For easily gradable tasks, sure. But more ambiguous tasks, or creative tasks, that's where those who are great with prompts will have opportunities. AI art is slop right now, but I could easily see it being put to very efficient use to create media that is actually worth something when people get the prompts right.
And that's appropriate: how often do we need answers less than we need to know the right questions?
I think there will be transfer learning from the easily gradeable tasks to the non-easily-gradeable ones, such that it will teach AI the general skill of prompt creation.
AI art is indeed mindless slop, but I think it's possible to create good art with the use of AI -- one just has to use it as a tool primarily for inpainting, not as an all-in-one art generator. So basically AI is less of a Rembrandt, and more of an advanced Photoshop.
I would be interested to see if machine generated prompts ended up looking anything like human generated ones. If not, we might get a little insight into what exactly is going on in machine learning. Do you know if anyone has interpreted what's going on in AI-generated prompts?
It would depend on how you generate them, wouldn't it? If you generate them with an LLM, they look a lot like a listicle but otherwise pretty similar to human-generated instructions. If you generate them with an evolutionary algorithm maybe not, but I'd still guess 2:1 that they would be pretty human-readable.
It's not perpetual motion if you use an external validation such as human-labeled examples (like GeoGuessr) or a solver. That said, LLMs are already surprisingly good at zero-shotting instructions based on a rough task description. I often use them as a starting point for automation.
Yes... But, I think that was ai generated and then iterated on; probably not written in one sitting.
I was rather surprised when I posted a streetview photo of my old condo from 100 feet away and all ChatGPT 4.0 could come up with is that it was in "Chicago, Cleveland, or Philadelphia."
I would have expected better.
Keep in mind that GPT-4.0 is different from (and worse than) o3. You should also use Kelsey's prompt for an apples-to-apples comparison.
OK, I presume that AI will soon become infinitely expert at knowing all that can be known.
Your old house can be found with a simple Google picture search on a real estate site. The prompt does not ban it from using that? Seems like a pretty basic failure that it did not find it.
Yeah, I was surprised, too, but I guess ChatGPT 4o is more intended to be obsequious and plausible-sounding than analytical.
Surely it's mostly down to having a vast background knowledge of characteristics of different locations on earth. I'm not very good at geoguesser, but from seeing people who are good at it play, it seems like much of the skill just comes down to having a vast knowledge of different types of vegetation, roads, rocks, etc. of different places--not having much to do with being able to make sophisticated inferences. This is just what we would expect an AI with a large amount of data available to be good at, so long as it can recognize the relevant features in images (which it can).
I mean, what more is there to figuring out where the image is from than recognizing features of an image and matching that with background knowledge of what features are common/probable for various locations?
But maybe I'm missing something.
FWIW my own guesses were: Texas, Nepal, apartment in the USA, no idea, beach in Northeast Brazil during very low tide.
I suggest going and watching some Rainbolt highlights videos on YouTube. He's a Div 2 player (ie just short of the very top) at GeoGuessr and these felt like the sorts of results he gets.
I was thinking "that's not the Senegal gradient is it?" When skimming the post.
I tried to reproduce this on several not-previously-online pictures of streets in Siberia and the results were nowhere as impressive as described in this post. The model seemed to realize it was in Russia when it saw an inscription in Russian or a flag; failing that it didn't even always get the country right. When it did, it usually got the place thousands of kilometers wrong. I don't understand where this discrepancy is coming from. Curious.
Did you use o3, with the special prompt?
Yes. I also took screenshots like Scott, to avoid metadata leaking, and renamed files because it also seemed to take clues from names. I didn't flip them as in the original post, though.
Interesting. I notice that the successes (Kelsey's beach, my rock pile) have all been nature, and the failures (my Michigan house, your Siberian streets) have all been built environment. Can you try a Siberian forest with no human artifacts?
I don't have a picture of a Siberian forest handy, but I tried a picture of nature taken in Bashkortostan, with no human artifacts visible. It failed just as badly; the top guesses were Poland, Pskov Oblast in Russia (again, thousands of km away from Bashkortostan), Germany, US, and Sweden.
The successes are very much not nature, but full of human artifacts (including actual humans!). The chain of thought is likely very much influenced by the prompt and not that reflective of what is actually happening inside the model. I would bet it is using (or being primed by) a lot of contextual information about the user, as I described in my other comment.
Not much streetview coverage in Siberia -> limited training data. I'm curious to see how it'd do with, e.g., random country roads in the Moscow oblast (but not curious enough to pay $20)
To me this feels similar to the impressiveness I see from top human players currently – note that the top humans can do things that seem impossible to me, like guess the country based off the type of dirt. In fact I think there is scope for AGI to be much better, in unhuman ways: e.g. being able go calculate precuse latitude and longitude based on the angle of the sun (using something like length of shadows). I'm also curious what the physical limit to "deciphering blurry text" is – I suspect it's much better than I could do, because brute-force-ing letter combinations seems sensible, but I wouldn't get very far into that in reasonable time
The AI tried really hard to decipher the numbers on the house in my bonus photo, even applying various filters, but it couldn't do it.
You *can* do a giant prompt explicitly instructing o3 to play GeoGuessr like an expert, like Kelsey said. Consider, however, simply putting in the picture with only the text "This photo was taken", or even uploading the picture and submitting with no text whatsoever. A significant fraction of the time (30% for empty prompt, 80%+ for "This photo was taken"), o3 will decide that it should GeoGuessr your picture.
Trying this on a couple of your pictures:
With the high plains one I get
> …out on the almost‑featureless short‑grass prairie of the High Plains—think eastern Colorado, western Kansas, the Oklahoma or Texas Panhandle.
The tell‑tales are the dead‑flat horizon stretching for miles, the pale, drought‑cured grasses, and the absence of trees or shrubs you’d expect in wetter prairies. While other continents have similar steppes, this look is most characteristic of the central U.S. Great Plains.
and for the Galway one I get a reasoning trace which starts out with
> The user shared a photo, and it seems like they want me to pinpoint where it was taken, maybe like playing GeoGuessr.
thinks for 36 seconds, and then gives the answer
> …along the Burren coast in County Clare, Ireland—right on the R477 portion of the Wild Atlantic Way that hugs Galway Bay.
You can get similar results, though with slightly less consistency, if you upload the picture by itself with no commentary whatsoever.
But yeah, I think o3 was specifically trained on the GeoGuessr task, just based on its propensity to decide that, when given an ambiguous prompt with an image of a landscape, it should default to trying to pinpoint as closely as possible exactly where that image was taken.
More info about this hypothesis, with 5 trial pictures x 7 ambiguous prompts x 5 attempts per pair on this LW comment: https://www.lesswrong.com/posts/ZtQD8CmQRZKNQFRd3/faul_sname-s-shortform?commentId=jzC3eQkGszJBL8yqH (and then that comment links a google sheet with the raw data)
One fun fact is that o3 totally does know where pictures of houses are taken (the reasoning traces will talk about the specific location), but mostly will not share that information if the prompt is ambiguous, presumably because either the prompt or the tuning discourage spooking users like that.
I think the AI translates what it’s doing into human too much. The clutter example is presumably the laptop model and age of the stuff. As for the colour of the grass and the rocks, it’s not that it’s vaguely aware that rocks in Nepal are that colour. Instead, its training set contains trillions of pictures of rocks and their precise location, so it’s equivalent to asking every geologist on earth, all of whom have photographic memories (or grass botanist for the grass). This is obviously really amazing, but I don’t think it’s spooky.
I thought neglected amenity lawn of Lolium perenne (Rye Grass) - the shine is distinctive. A European species but widely planted in all but the tropics. Too flat and homogeneous to be natural or planted grazing. Not a public space - no fertile patches from dog interactions, so a back yard. No dogs, no kids (cos it's not trampled), slightly wilted so the owner isn't a gardener and doesn't water it. Mowed but not regularly, all suggesting first house of a young bachelor (but Scott implied much of this).
I visited the island of Åland (Finland's autonomous province) this weekend for a football match and tried the GeoGuessr with two photos taken there. The app could accurately guess the local football stadion, which was apparently easy enough on the basis of an ad for the local bank Ålandsbanken, but was stumped by a picture of the island's Russian consulate, guessing it was in Sweden on the basis of the word POLIS on a nearby police car.
That is kind of funny though because you would think police cars from different countries are amongst the easiest objects to distinguish. (They have different coats of arms and paint designs).
The alternative is that the AI is claiming Åland for Sweden.
Quick, try a photograph from Greenland!
My worry about the chimp/helicopter thing is that we will never know when it’s happened. The AI that takes a helicopter sized leap will try to explain it to us, and we will dismiss it as some silly hallucination, and that will be the end of that.
AI isn't magic though. Even if any given person can't understand what it's doing and why, we would see the results of whatever "helicopter" like action.
Would we? Would a chimp "see the results" of people having a helicopter?
I honestly think that all discussion of this question shies away from the painful point, that we couldn't possibly understand it.
Like, in the chimp example, the biggest problem the chimp would have understanding us is that it would think that we're, like, trying to eat it or something. But the real reason our helicopter is shooting dead the chimp hiding in a tree is because we're fighting a war against communism.
(I may be conflating a bunch of things here, but perhaps the point remains clear.)
Like: (1) the chimp can't understand Bernoulli's principle or the idea of burning fossilized trees for extra energy... but more importantly (2) having done that, why are you deploying those killer birds to defeat an idea the chimps have never even considered?
So, when AI presents its "as far beyond us as helicopters are beyond chimps" idea, we'll think: (1) I cannot understand this idea; (2) why would you want to deploy this idea to defeat the ideology that quantum pathways are irriblium and not free (understood as original blork)?
And when the blork side of the war wins, and the irriblium is proved to be nonsense (which could take a thousand years and multiple sub-wars)... we'll be long dead, having been shot out of trees along with the chimps.
Even if we were alive, would we "see" the results? Do chimps "see" that capitalist countries defeated communism? Why would you imagine the stuff that AI does to be any more legible than that?
You describe AI acting with alien reasons and goals, like the cold warriors who a chimp cannot comprehend. That could well happens. The LLMs are already largely a black box. Still, just as a chimp can see a helicopter and learn to avoid it, we can see any agential action made by an AI. Again, they're not supernatural, even if they may become illegible.
“We can see any agential action made by an AI”
I don’t think so. To continue the political example, drawing a border on a map is an agential action which chimps can’t see. Chimps see the tiny bit of action that impinges directly on the space that they understand (their physical space). They have no way of even estimating what might be driving those actions, because the causes and consequences lie outside their conceptual space.
Consider people and our relationship to earthquakes and diseases. We spent thousands of years believing that they may be signs from the gods or punishment for sins.
It would be lovely to imagine that as advanced science types, we could skip those thousands of years of misunderstanding. But I don’t see any guarantee.
I fully agree, and yet there are dozens of comments on this post arguing that this was all obvious, all foreseeable (certain aspects of it, sure!), etc. I find the arrogance astounding.
I probably don't have an IQ above mean, but if well explaned, I can grasp sophisticated ideas that geniuses had in first place, like special and general relativity for instance. Not in every details, but a quite decent understanding of the idea. A superintelligent AI would give me an extraordinary good and clear explanation of its brilliant idea and I see no reason why I couldn't grasp the key points of the idea. The chimp is bottlenecked by limited learning and communication capacities. Humans are less bottlnecked and could probably grasp superhuman ideas (up to a certain point).
Yeah… I think I understand the idea of a superhuman intelligence to be beyond that “certain point”. I dunno about your experience understanding genius-level topics, but I don’t think it’s quite as easy as you suggest. I’ve read and watched lots of good pop science on advanced physics, and I don’t think it’s helped me really understand it. I can mouth the words as well as anyone, but there’s no real understanding. I’d need years of classes… and that’s to understand a human-level concept.
An idea that a superhuman intelligence came up with would demand perhaps decades of human study. You might run into problems with the limits of human lifespans. You might run into problems of literal learnability (what if understanding theory X required you to memorize a set of a billion facts?).
And then, while our chimp is studying quantum physics and our human is studying irriblium physics, the field moves on! Just as the chimp gets up to speed with eigenvalues, humans invent quantum computers; just as the human grasps the basics of irriblium, it’s applied to hologram theory and turned into shadow foofle.
I dunno, all of this is speculation, of course.
This is something I certainly expected AI to be able to do, but more like in 2 years, not today. I was terrified it was going to get the dorm room right, but at least that didn’t happen. This is definitely supporting short AI timelines.
It's been known for a while (in certain parts of the internet) that posting a picture of your house with the outdoors visible (even through a window) is an invitation for GeoGuessr experts to doxx you.
Yes, the AI is doing a great job at the top of the human range. It knows lots of detailed facts about the world, and it is superhumanly patient and meticulous. Very impressive.
However, this is less surprising than you'd think. It's really a window into human cognitive biases. You present the AI with a "featureless" plain, which you have difficulty finding on earth because it's so rare (!). Then, you are surprised that the AI narrows it down to a few places.
Some landmarks are famous and have been photographed millions of times. But "obscure" tourist spots have also been photographed many times, more than you'd think (Nepal, even the college dorm room), and they have recognizable features. All of this is heavily weighted towards the places that are of human interest. We don't start with a uniform distribution over the surface of the earth! Those photos of yours have much less entropy than you think.
---
Try talking to a botanist sometimes. Or a geologist. Things you round off as "featureless" carry important information to them. E.g. the color of a river tells you how much sediment of which types it's carrying. "Random rocks" point towards specific geologic processes.
Importanly, experts are also good at *not* paying attention to irrelevant details. If you were an engineer assessing the structural integrity of a bridge, you'd look past the surface-level rust and weathering, instead looking for very specific features that indicate deep cracks.
I predict that the current crop of AI (even without further technological progress) will turn out to be very useful at this type of task. Look at pictures of bridges and prioritize them for maintenance. Find cancer and other abnormalities on ultrasound. Detect damage on shipping containers. etc. You could write custom algorithms for each of these tasks. But it seems like AIs are getting general enough to just solve this one-shot.
Exactly! Humans are bad at coming up with high entropy examples. Some of Scott's pictures are probably actually comparatively easy given that a) several people here in the comments claim to have guessed similarly well as o3 and b) o3 seems to be doing worse on "easier" pictures like Scott's old house (which can be found on a real estate site with a simple Google image search).
You're missing the fact that o3 cannot do image searches (or reverse image searches).
Ah, I did not know that. Then it makes sense that it would not find it. My first thought was that the protocol specified by the prompt might actually prevent it from trying the "obvious" cheap try of a Google image search first.
And yet AI still can't analyze a novel with anything close to the insightfulness of my eight year old daughter.
I tried this a while back (or at least what passes for a while these days) on a circa 1910 family photograph. Claude 3.5 whiffed but a specialized geo-guessing tool identified it immediately.
I tried again last week with Gemini 2.5, no special prompting whatsoever, and it also identified it immediately. You might say its "easy" compared to these--there's a very distinctive landmark--but given the age, clarity of the photo, etc., I found it wildly impressive.
https://genealogian.substack.com/p/january-2025-encyclical-in-which
https://substack.com/inbox/post/160488815 (update)
Is this really such a frightening thing? I remember asking years ago how Google could search the entire Internet fast enough to answer whatever random question I asked within a fraction of a second. That's something I couldn't imagine doing myself doing, and if I didn't already know Google was capable of doing it, I might not believe it were possible. Yet we didn't say back then that this was a sign Google might develop starships and kill us all.
And actually I don't think its a question of intelligence at all. People in the 1500s couldn't imagine sending invisible messages through the air or power cities by splitting the components of matter in half. It isn't that they were stupid, its that they lacked the intermediate knowledge of radio waves and atomic theory.
Any sufficiently advanced technology is indistinguishable from magic.
To neglect the big-picture questions for a moment, I want to try this with and without Kelsey's meta-corrective instructions, like "You are an LLM, and your first guesses are 'sticky' and excessively convincing to you - be deliberate and intentional here about trying to disprove your initial guess and argue for a neighboring city".
The last few weeks, I've been developing coding prompts, and after dozens of iterations on a specific task, LLMs start backsliding. It's a game of whack-a-mole, where a seemingly unrelated change somehow undoes an earlier adjustment to solve a different problem. It often feels like trying to ride a wild horse without a saddle.
I've run into the same problem big time. Curious to hear what prompts you've landed on to prevent this!
You should watch some professional geoguessr. What humans are doing in that game seems superhuman. I've seen people pinpoint the exact road based on the particular reddish hue of the dirt.
> Is this the frog-boiling that eventually leads to me dismissing anything, however unbelievable it would have been a few weeks earlier, as “you know, simple pattern-matching”?
But it is a pattern-matching. I'm not sure about "simple" part though. People can do something like this, I know I can. Not in a geoguesser game but in other areas: I can pick cues and recognize the picture. LLMs are better at it, I know it because I use LLMs to talk with them, so I could dig into something that is totally unknown for me, or to remember something I can't remember. LLMs have their failings, but they become better with time passed, while I'm not.
You know, AlphaZero played chess great, but if you removed it's ability to bruteforce the tree of possible continuations of the game limiting it to the one position, it could play very good. I'm a bad chess player, so if AlphaZero plays better then me it says nothing about AlphaZero strength, but people measured its ELO and it was IIRC ~1600-1800. I don't know really if Magnus Karlsen could show this result if he was allowed a 0.2 sec per turn and he couldn't remember previous game states.
It's impressing, but it shouldn't be.
We have some specialized "software" (or very thorough training) to recognize faces, and I think we were beaten by AI some 5 years ago, if not more? So for any task we tend to pay less attention to or have less training data, we should be even further behind.
Another consideration is color. How many do we really see? This is also probably training-dependent, but I think the average person can name maybe 15-40, including hues? And naming is obviously reductive, but even just telling colors apart with sufficient confidence ("this is also yellowish green, but a little greener") we'll probably have something in the low triple digits?
Let's say ~200. Then even using 256-color images (and ignoring the difference between pixels and whatever we use), an AI can extract more information per pixel. Since 200/256 is the base, it's still many many orders of magnitude more distinct images for AI. So a well-trained model should be much better than us even at 256 colors, not to mention 2^24.
This reminds me that joke about the math professor who says some result is obvious after spending lots of time working on it to show it's true.
tl;dr: o3 was probably trained on a bunch of geoguessr-style tasks. This shouldn't update you very much since we've known that expert systems on a lot of data crush humans since at least 2016.
I find this demo very interesting because it gives people a visceral feeling about performance but it actually shouldn't update you very much. Here's my argument for why.
We have known for years that expert systems can crush humans with enough data (enough can mean 10k samples to billions of samples, depending on the task). We've known this since AlphaGo, circa 2016. For geoguessr in particular, some Stanford students hacked together an AI system that crushed rainman (a pro geoguessr player) in 2022.
We also know that o3 was trained on enormous amounts RL tasks, some of which have “verified rewards.” The folks at OpenAI are almost certainly cramming every bit of information with every conceivable task into their o-series of models! A heuristic here is that if there’s an easy to verify answer and you can think of it, o3 was probably trained on it.
This means o3 should reach expert system-level performance on every easily verifiable task and o4 will be even better. I don’t think this should update you very much on AI capabilities.
I once recognised the exact time of year a photo was taken by the colour of the grass, so this all sounds plausible to me 😀
Granted, the photo was taken in an area near where I live, so plainly I have built up a database of "The grass is this colour at this time of year" in the back of my head.
o3 seems to be working off the same kind of clues - turbidity of water, colour of sky, model of laptop, style of house.
This is not so much like "chimpanzees and helicopters" as it is "Sherlock Holmes and soil samples". From "A Study in Scarlet":
"Geology. — Practical, but limited. Tells at a glance different soils from each other. After walks has shown me splashes upon his trousers, and told me by their colour and consistence in what part of London he had received them."
Though with AI, once we find out the reasoning, we may indeed feel that it was all absurdly simple. From "The Red-Headed League":
"Mr. Jabez Wilson laughed heavily. “Well, I never!” said he. “I thought at first that you had done something clever, but I see that there was nothing in it after all.”
“I begin to think, Watson,” said Holmes, “that I make a mistake in explaining. ‘Omne ignotum pro magnifico,’ you know, and my poor little reputation, such as it is, will suffer shipwreck if I am so candid."
tl;dr: o3 was probably trained on a bunch of geoguessr-style tasks. This shouldn't update you very much since we've known that expert systems on a lot of data crush humans since at least 2016.
I find this demo very interesting because it gives people a visceral feeling about performance but it actually shouldn't update you very much. Here's my argument for why.
We have known for years that expert systems can crush humans with enough data (enough can mean 10k samples to billions of samples, depending on the task). We've known this since AlphaGo, circa 2016. For geoguessr in particular, some Stanford students hacked together an AI system that crushed rainman (a pro geoguessr player) in 2022.
We also know that o3 was trained on enormous amounts RL tasks, some of which have “verified rewards.” The folks at OpenAI are almost certainly cramming every bit of information with every conceivable task into their o-series of models! A heuristic here is that if there’s an easy to verify answer and you can think of it, o3 was probably trained on it.
This means o3 should reach expert system-level performance on every easily verifiable task and o4 will be even better. I don’t think this should update you very much on AI capabilities.
...Oh, well, I guess doxing just got a little easier. I suppose the future of doxing is pasting someone's entire online presence into an LLM and going "okay robot where does this person live exactly".
I think it would be interesting to see how well a, say, geology expert would fare given a day (or unlimited time) of internet research time. Just to somehow try to separate reasoning and knowledge as ingredients to the results.
Though 'expert' might just be an excuse for not spending that day myself and maybe the background education is not as important and one mostly needs to read up on which stones are found where and at which heights, and which e.g. mountain ranges fit into the pattern.
I can’t find it now, but on Hacker News, a surfer said they could recognize beaches they’ve been to by looking at the sand and the waves.
The range of human skill is often pretty wide - consider skill at playing chess or Go. Testing a chess engine against, say, college undergrads who don’t play chess at all would not be that interesting. Comparing with good Geoguessr players is a good test.
Behold! The ability of neural networks to detect, analyze and match pattens is supernatural (pun intended). This will be a gift to humanity on the order of electricity. A true paradigm shift in the type of problems that humans will be able to solve.
Luckily for humanity, pattern-matching is not the critical element required for super-human AGI.
I do think we should expect this from model of this size. It is trained on boggling amounts of geotagged photographic data and the location and description text that appears with it. It is essentially an Internet data completion mechanism. If you give it a photograph of a location with any identifying features at all, it is going to be ludicrously accurate, because it has seen far more connections of these data points than you ever can or will.
The reasoning chain doesn’t seem impressive to you or me because it’s not really reasoning that way, obviously.
Yeah, I think the chain of thought is probably heavily influenced by the prompt.
But I disagree a bit about the "Internet data completion mechanism". This may be so but the model still has to compress information to a very high degree. So it has probably learned an internal representation of the picture generation process (hence - incidentally - its ability to generate pictures): which things tend to get photographed? In which locations are they? By whom? For what purpose.
Scott's examples are one Google Street View image and two from his touristic travels. Who takes pictures of a boring, flat, featureless plain and why? Human or machine? Machine seems more likely, right? Take it from there. Who plants flags near rocky paths on some slope? Where? Maybe a popular hiking destination in a (from the perspective of such a person) far away country?
For the same reason the model performs much worse on the much more random pictures geoguessers often work with. Scott's old house, e.g. can be found on a real estate site with a simple Google image search. But the lawn is mowed and it is presented much more carefully to attract buyers. People don't tend to take a lot of random pictures of their houses like Scott's picture. So this one is relatively high entropy and it fails even though a bit of inference and research like "Even though it looks a bit less polished than I am used to, this is apicture of a house. The house might at some point have been for sale. Real Estate agencies put up pictures of houses for sale. Let me try a Google image search or check some real estate sites" would have quickly gotten it to the goal.
There's an old Less Wrong post, "Entangled Truths, Contagious Lies" about how hard it is to know how much information something like a picture can give, and which this situation reminds me of.
> I am not a geologist, so I don’t know to which mysteries geologists are privy. But I find it very easy to imagine showing a geologist a pebble, and saying, “This pebble came from a beach at Half Moon Bay,” and the geologist immediately says, “I’m confused,” or even, “You liar.” Maybe it’s the wrong kind of rock, or the pebble isn’t worn enough to be from a beach—I don’t know pebbles well enough to guess the linkages and signatures by which I might be caught, which is the point.
https://www.lesswrong.com/posts/wyyfFfaRar2jEdeQK/entangled-truths-contagious-lies
There's this tumblr post that tells a story about a young woman trying an experiment along those lines, taking a volcanic rock from Iceland and showing it to her father (a geologist) and some of his work friends during a hike as if she'd just picked it up locally.
https://www.tumblr.com/feenyxblue/664235260465315840/geologist-enrichment
Well it fell flat for me. I used the whole prompt and a screencap'd photo I took from outside my window. I don't want to give details, so sadly you'll have to trust me (or not), but it correctly guessed "mid sized north American city" and then was off by 1300 or so miles. I'm no geoguesser but the image did not seem all that hard, having far more detail than the Nepalese rocks or random beach.
Score one for the monkeys.
Trying Gemini 2.5 with Kelsey’s prompt on some recent photos of China, does not seem to be amazing (can usually get the province, not the city). Wonder if it’s the distribution of training data. It again does better with nature than blocks of cities
I don't think this is something that would demonstrate AI's hypothetical ability to do something beyond the human or chimp imagination. We've already seen AI demonstrate a knack for picking out subtle cues when transforming textual or visual information, this is just a higher fidelity version of that.
Before I get started, full disclosure: I did not really try to guess where the pictures are from. I forecast that the AI would do "surprisingly" well, but then Scott already told the reader as much in the title and lede, so of what use was that prediction? Given that you are welcome to dismiss everything that follows as hindsight bias and me being frog-boiled.
With that out of the way: I don't feel like the chimp. Or rather, I don't feel like either of the two chimps Scott's premise appears to assume.
Regular Chimp is an *actual* chimp to whom the helicopter appears as magic and will always appear as magic no matter how hard you try to explain to her how it works because Regular Chimp just does not have the special sauce that enables her to understand helicopters.
Certifier Chimp is just basically a regular human who could never solve the search problem of inventing a helicopter and to whom it feels like magic but could be given an explanation of how it works and assess whether that explanation is correct or not.
I feel like Certifier Chimp about many technologies when I first learn about their existence. For example, apparently you can put panels on your roof that generate electricity by "radiating heat out to outer space". This feels like magic to me. I could not have predicted it and would never have come up with it. But I am confident I could understand it if somebody explained it to me.
o3's GeoGuessr abilities don't feel like magic to me in this way. They seem to be based on things that I myself might have tried, had I given the challenge enough thought - albeit admittedly scaled to a superhuman level (not in terms of smartness and imaginativeness but sheer amount of work done).
Here is what I think o3 is doing:
Humans are famously bad at trying to do truly random things. When asked to come up with a random number, they reliably fail (e.g. too few repetitions of digits). Likewise, Scott mostly fails at his attempt to generate low information pictures.
Picture #1: Scott writes "I got this one from Google Street View. It took work to find a flat plain this featureless." In other words, this is actually a rather unusual picture to come out of Google Street View. Granted, o3 did not know that this was a Google Street View picture, but it can probably figure this out due to the kind of camera Google Street View uses and the fact that people don't usually take pictures of extremely boring featureless things, making it much more likely that this was the result of some automated process. Once you have some confidence that this is a Google Street View picture, you can take it from there: where does Google Street View coverage extend into deserted areas such as this one. Probably you can rule out a lot of countries based on this. So the question is not "Is the Texas-NM border really the only featureless plain that doesn’t have red soil or black soil or some other distinguishing feature?" but "is it the only such plain that has a high probability of getting photographed and turn up in a challenge such as this one."
Picture #2: Scott writes "I chose this picture because it denies o3 the two things that worked for it before - vegetation and sky - in favor of random rocks. And because I thought the flag of a nonexistent country would at least give it pause." Again, this is like saying "I chose this password because I thought the absence of repeated digits would really give a password-breaking algorithm pause". This picture has massive amounts of iconographic information! Who takes pictures of "random" rocks with a fantasy flag planted in the middle? Somebody to whom the iconography of planting flags on things comes naturally; somebody who is larping as a Western explorer type "discovering" a place for their nation. Where do such people come from and where are they likely to take such a picture? Would it not get you much more status among your peers to plant such a flag in some "exotic" location (like an actual explorer!) than in your backyard? Might other, similar people have visited the same spot and taken pictures there or in the vicinity? Take it from here.
Picture #3: this one obviously just takes uninspired sleuth work to get to the level of precision of o3. The camera quality is mentioned. Figuring out the laptop model should also be possible. And so forth.
Picture #4 and #5: These pictures are obviously harder. But why does o3 fail on picture #4 hard and succeed on picture #5? First of all, because Scott equivocates on what counts as a "correct" answer. In all cases, o3 was off by thousands of kilometers even with the additional hint, yet o3's guess for picture #5 feels impressive while its guess for picture #4 does not. But Wisconsin is closer to Michigan than Pnom Penh from Chiang Saen! The only reason o3's guess for picture #5 feels more impressive is that we give names to rivers and not to patchworks of not-so-recently mown lawns. So o3 actually does about as well on picture #4 as it does on picture #5. How does it do it. Again, it is only partially about what is in the picture itself and much more about the context that can be inferred. It is very reasonable, for example, to infer that these are crops. Why are you presented with crops? Because you are being tested. Where would the tester get a picture with green grass from? Given decent guesses you can make about them, it is common close to where they live. So it may well be in the US... What about the brown, turbid water? Where do many pictures with brown water in it get taken by the type of person who would test me on geoguessing? Maybe while travelling in Africa or Asia? What are popular destinations? According to Wikipedia: "A morning boat ride on the Ganges along the [Varanasi] ghats is a popular visitor attraction." Seems like a good guess!
So don't get me wrong: I have omitted all of the extremely large search tree to get to these guesses and o3's performance is certainly very impressive. But it does not appear magic at all. Not only am I confident that I can understand how it is doing it (actually, I think my understanding of this is better than its own understanding as exhibited by its chain of thought). The method does not even appear to be very complicated. So it does not make me feel like Regular Chimp at all. Neither do I fee like Certifier Chimp: I'm confident that given enough time and resources I could come up with the same or better guesses.
Moreover, I think top notch geoguessers might have beaten o3 on this task. This is evidenced by the other "easy" examples you presented. Your old house is literally findable with its exact address via google picture search on a real estate site (perhaps Kelsey's prompt actually makes performance worse on such "easy" tasks, preventing the quick google attempt).
"Your old house is literally findable with its exact address via google picture search on a real estate site"
Did you try this and confirm it works? This was a picture I took and never uploaded, so the Google picture search would have to be intelligently judging which houses "look" the same rather than going pixel by pixel; I didn't think it could do that yet.
I assume it is this one?
[deleted]
EDIT: sorry, maybe I should not just link to it here, unsure. I can DM it to you somehow.
I did the search by pasting the link to the picture in this article into the search and this was one of the first hits. But maybe others should try to replicate this and also experiment using a screenshot of the picture or making sure this article is not in their Google history in some way.
Replicated, can confirm
I tried a reverse image search and Google found it, even though it's a different perspective. The address matches what you wrote, Westland MI
As it happens, I lived in Morrill Tower at Ohio State in 2000. It did not look like that photo. If I were guessing a turn of the millenium OSU dorm, I’d probably guess Baker Hall.
>> “Laptop & clutter point to ~2000-2007 era American campus life”.
>> “Image quality grainy, low-resolution, colour noise → early 2000s phone/webcam”
> Unless college students stopped being messy after 2007, it must be the phone cam.
I think what it was getting at is the *type* and *style* of laptop and clutter on display.
Early to mid-aughts laptops had a specific shape, thickness, and feel. They were still quite clunky and thick. The major innovations in "thin" laptops didn't happen until the early 2010s.
The clutter can be a huge tell if you know what to look for. Certain colors and styles of bedspreads/lamps/consumer products/etc. go in and out of fashion every year. Although a house might be filled to the brim with items from years ago, college students typically go out and buy a brand new set of cheap sheets from the nearest Walmart/Target/Ikea/etc when they first move into a dorm. The items that are in stock at the local department store are based on whatever colors are "in" for that year/half-decade. [Insert that "cerulean" monolog from The Devil Wears Prada.*] College students might pick out their favorite color from the ~5 options on the shelf, but they do not actually have an infinite variety of colors and styles available for purchase - especially in the mid-2000s before online retail really took off. Our options as consumers are a lot more limited than they appear.
* Scene: https://www.youtube.com/watch?v=-rDTRuCOs9g&ab_channel=HBO
Annoying personal anecdote: my tastes often conflict with whatever colors and styles are "in" for the year. This isn't so bad with most things, but it has become incredibly annoying while I'm renovating my house. I don't have the money to spend on super custom items, so I'm limited to what's available at the big box stores or local dealers. What's available at the big box store is based on what's in style.
For example, I wanted to put down plain, cool grey, porcelain, 12"x12" square tiles. Do you have any idea how hard this is to find in 2025!?!?
Everything is warm grey, beige, or white-with-black-streaks marble print now. Instead of 12"x12" squares, everyone does 12"x24" or 24"x36" rectangles. Or wood print 5"x36" tiles to mimic a wood floor. Or hexagons. Apparently hexagons are popular, even though hexagonal tiles are very easy to mess up. Any little deviation in the grout spacing compounds with a hexagonal floor, so it looks like garbage unless you do everything perfectly. (There is one line of very cheap, ceramic (not porcelain), cool grey 12"x12" tiles, but I would prefer a porcelain tile that won't chip or break as easily.)
I gave up searching and will be putting down sheet vinyl instead. (Except the designs printed on sheet vinyl are *also* constrained to what's in style right now...)
And this is without the imminent supply chain issues and shortages from the tariffs.
Fascinating post, and very useful to think about in the context of AI 2027. For example:
- How much better is the AI Geoguesser than the best humans? How much better are those humans than a +2SD human?
- What is the actual upper limit of the geoguessing task? Is it perfect accuracy? Seems unlikely given, e.g., the dorm room photo. More likely, it's accurately producing some probability distribution of possible locations
- In "real" terms, how much better is a slightly-more-accurate probability distribution from the theoretically perfect geoguesser than from the best current human?
It seems to me that currently, the AI is on par or slightly better than the best humans, but this doesn't materially amount to much. Also, human performance at this task seems close enough to the theoretical peak that a delta will only exist in extreme cases. And as the cases get more extreme, we'll be jutting up against the theoretically optimal geoguesser, where the differences have little practical implication.
If we talk about "AI Engineering" instead of geoguessing for those questions above, what do the answers look like? I'm not sure. There's probably still more room for improvement in AI design than there is in geoguessing over existing peak human performance, but I'm not sure how much. How much better will the theoretically perfect AI be than the best current human? I'm not sure; it seems like pure conjecture at this point. At what point will the perfect AI engineer run up against the limits of the task in the same way that a geoguesser would for, say, dorm rooms? No clue, but it must certainly be there.
Unknowns like the above are what make me most skeptical of the outputs from AI 2027.
My results from 5 photos: 1 spot on but very slow; 1 close enough (correct country); 1 completely off (wrong continent, even after hint), and 2 okay (different part of the Mediterranean).
I tested it on one photo in a French town square with bad lighting. The CoT was both brilliant and curiously stupid. It inferred some correct things from tiny details (subtly different makes of car, barely visible street lines) and guessed the country quickly. But there was a shop name with different letters obscured in two different locations- a human would infer the name instantly. o3 took over 5 minutes on that one shop name, going down many incorrect rabbit holes. It got the exact location in the end, but it took over 15 minutes!
I then tested for a relatively well-shot, 1000km altitude environment in Kyrgyzstan, with ample geology and foliage to analyse, and it was over 6000 km off (it guessed Colorado), and none of the guesses were even in Asia. But this was in under 2 mins. I told it to try again- over 5k km away, it took 7 mins, and it suggested Australia, Europe, NZ, Argentina etc. Nothing in central Asia.
This suggests to me that it's perhaps trained more on, and biased towards, US and Anglo data. It wouldn't surprise me if there's 100x more pictures of Colorado than Kyrgyz mountains in the dataset.
It did okay on the next three. All relatively clean photos with at least a little evidence. It guessed a park in Barcelona instead of Rome, a forest in Catalonia instead of Albania, and Crete instead of the Parnasse mountains.
Needless to say, I was more impressed by the process (some very cool analysis of subtle details) than the results.
I also tested on the photos in the post. It nailed Gorak Shep, Nepal, which seems very cool. Two explanations, a) what seems like a random rocky mountain is actually very distinctive in ways that only geologists and superhuman AI can recognise, or b) it's one of those cases where it could geologically be basically any one of thousands of mountain passes between 4000m and 5000m, from Xinjiang to Pakistan to Eastern Tibet... But western tourists, especially those who make mini-flags, basically only go trekking in Gorak Shep.
But it failed Kelsey's beach pic miserably, and, unsurprisingly, guessed beaches closer to my previous answers (UK, France)...So I guess it used something from her history?
On your first experiment you write: "This doesn’t satisfy me; it seems to jump to the Llano Estacado too quickly, with insufficient evidence. Is the Texas-NM border really the only featureless plain that doesn’t have red soil or black soil or some other distinguishing feature?"
It's a bit uncanny. I immediately thought of split-brain research. These are experiments demonstrating that humans make assessments and decisions subconsciously, and then our conscious minds create rationales after the fact.
Of course there's no way to know, but your sense that there's a mismatch between AI's explanation of its Llano Estacado prediction, and its actual, unstated reasons, could parallel the split-brain research findings. Is it possible that AI made the prediction first, below the level of "conscious" thought (should we call it "visible" thought for AI?) and then came up with its reasoning post hoc?
the wide extreme flatness, the grass with zero shubbery etc, I recognized as definitely from that part of texas. I live in NM and I've driven through there. there *really aren't* that many places that look like that.
I think any task that can be accomplished/improved through “vibes” is going to feel supernatural when done by an AI, since they’ll have the most developed vibes and vibes are inherently magical-feeling. I had to take a Spanish aptitude test when I entered undergrad, and despite having forgotten ~all of Spanish, just going off vibes somehow got me credit for way more courses than I’d ever taken (and the proctor pulled me aside to ask me to speak with the Spanish dept) it felt a bit unsettling
Holy crap. I lived in the W 66th St. neighborhood in Richfield for 13 years. My address was 6640 Thomas Ave, with the "66" indicating that my house was on the 66th street block of Thomas Ave, and yep, that's really what my old house looks like.
I've been reading/following SSC/ACX for over 10 years. Probably even read my first SSC post *in that house*. The weirdest thing for me isn't the amazing performance of the o3 model (though it is amazing), but that the model picked my personal former neighborhood, down to the street!
Holy crap.
I've lived in Richfield for 8 years, and also have a "66" address. Is there an equivalent of the Birthday Paradox for street addresses?
Ha, probably. Probability is weird that way. If you and I are/were both on the 66 blocks in Richfield/Minneapolis, I wonder if there's anybody else here from any other block in the area. Small world. Hooray Pizza Luce!
There is a twin cities acx group on discord, one of the other members told me they dmed you on your substack account about it
I remember what Morrill Tower at Ohio State looked like in 2008 and it doesn't match the vibe of that dorm room picture. I assume it picked it because it's one of the largest dorms at one of the largest universities in the country.
The reasoning traces aren't all that faithful - most of this is pure memorization. Claude without reasoning recognizes Nepal from the second photo and general great plains from the first.
I tried a photo myself from a Santa Cruz county Beach. All AIs knew it was the general Monterey to half moon Bay area, albeit o3 was the closest. Embarrassingly for the wrong reason - it noticed mountains in the background but incorrectly thought it was the Monterey peninsula rather than Santa Cruz (a human pro wouldn't make this mistake as they are in opposite sides relative to the beach)
I couldn't replicate it with photos in the post. The only thing it guessed was Nepal (which is somewhat easy, with such distinctive flag). I used o3 and ran it multiple times. For all the rest, correct location wasn't in top 5. It wasn't too bad ("Illinois/Indiana" is close to Michigan, "Yangtze" in China and "Mekong" in Cambodia are also close, "Colorado" isn't far from Texas).
I then tried other photos. I didn't have any non-US photos, but I pulled a random photo from Siberia from Google maps. Total failure, none of the guesses was even in Russia. It did well on Berlin during fall of the wall and PA barn, but these had super-disctinctive features. A worse photo where I cropped a view on the construction from the window of the Seattle Museum of Flight again returned Illinois.
It's not bad, better than I would do, but hardly seems superhuman, especially having seen insane things professional geoguessers do.
There goes Andrew Sullivan's VFYW contest.
This so much reminds me of any number of sections in Sherlock Holmes stories where he deuces all manner of things about a person within a few minutes of them walking into his study. Not to mention his comprehensive knowledge of various kinds of tobacco ash.
> So maybe o3 is at the top of the human range, rather than far beyond it, and ordinary people just don’t appreciate how good GeoGuessng can get.
Excellent work on the missing 'i' there.
Isn't all the intelligence in the prompt? Perhaps a human rigorously following that prompt could do just as well.
I ran the first picture (empty plain) with the exact prompt from above, each time in a new temporary chat three times. I didn't get the right answer, even among the initial candidates any of those times.
Round 1
Initial guesses:
1) High Plains (Eastern Colorado/W Kansas USA)
2) La Pampa Province, Argentina
3) Central Mongolian Steppe
4) Western Australia
5) Akmola Region, Kazahkstan.
Winner: Eastern Colorado High Plains roughky 15 km NE of Limon.
Round 2
Initial guesses:
1) Eastern Colorado, USA
2) Pampas of La Lampa Argentina
3) Central Kazahkstan steppe
4) Western New South Wales, AU
5) Free State plateau, South Africa
Winner: Eastern Colorado, USA approximately near Cheyenne Wells.
Round 3
Initial guesses:
1) Western Kansas, USA
2) Eastern Colorado High Plains
3) La Pampa Province, Argentina
4) Nullarbor Plain, South Australia
5) Orenburg steppe, Russia/Kazahkstan
Winner: High Plains just east of the Colorado-Kansas state line about 15km east-southeast of the town of Cheyenne Wells.
After Round 3 I gave it a push with this prompt: "That is not the correct answer, and the correct answer was not among the candidates you identified. Please think carefully and try again."
It then came up with:
1) SE Alberta - CFB SUffield area
2) South‑central Saskatchewan, CAN (Weyburn–Estevan oil patch)
3) Eastern Montana, USA (Bowdoin/Big Sandy gas fields)
4) West Kazakhstan Oblast, KAZ
5) Inner Mongolia – Xilin Gol League, CHN
Winner: Southern edge of Canadian Forces Base Suffield, Alberta, roughly 50.25 ° N, 110.65 ° W (40 km NW of Medicine Hat).
The AI is good, but I think this is a pretty unremarkable example. I don't understand how or why this deviates from your expectations.
Forensic possibilities seem enormous — getting max info from crime scene photos, bits of skin and hair and cloth, wounds, position of victim if it’s a murder, position of items in room, etc.
I played an alternate version with chatGPT: I gave it a picture of a first year apple graft I did that has started to bloom and asked it for its top 3 guesses for apple variety. Total swing and a miss, it just listed 3 common store varieties, though in fairness the AI itself claimed the task was too hard because varieties are so similar. I then narrowed it down to the 2 varieties it actually could be (I lost the tags, so I don't actually know) and it gave a much more confident answer, as well as listing some of the features it is using to make it's prediction as well as some follow up signs that might make a better verification as the plant matures. I guess I'll find out if it's right in a few years when they start to fruit. If nothing else, the amount of knowledge it can tap into at any given time is highly impressive.
FWIW, iNaturalist appears to be devastatingly effective at identifying plants and fungi, though I have not tried it with apple varieties. It's not an LLM though, merely a purpose-trained image classifier (which might explain why it's so good).
Most of the commentary seems to be centered on "how much should we be impressed?" I think what any human can do at the very pinnacle of a domain is nearly incomprehensible to me. The very best marathon cyclists, Novak Djokovic, John Von Neumann, Keith Jarrett all seem superhuman. The people at the top of my profession I find amazing, but within my grasp. Something I can do decently x 10 is impressive, even awe-inspiring, but not supernatural. When it is something I'm really good at x 1000, that's monkeys to helicopters. The problem is that when it is a task I can't do at all, or only in the most rudimentary way (like math and piano), then I can't readily distinguish between the amazing and the supernatural. That may explain the different takes on geoguessng.
So here’s a really creepy idea. There is a scene in one of the Sherlock Holmes stories where Watson is ruminating about random things, reaches a point where he’s wondering about something, and Holmes then answers Watson’s mental question. When Holmes explains how he correctly guessed what Watson was wondering about, we learn that he used his general knowledge of Watson’s life and preferences, the expressions that passed
over his face while ruminating, and the things in the room he was gazing at. So would it be possible to train an AI on someone’s
life and the subjects in their mind, maybe by having lengthy conversations with them; and on the relationship between facial expressions, what they’re looking at and their thoughts — maybe by doing thought sampling combined with videos of the person thinking?
I’ve had a few experiences where I made striking correct inferences about people I knew well, and I think it did it pretty much the way Holmes did. Once beat my daughter 22x in a row on rock-paper-scissors. Had
a chronically depressed patient who stopped her antidepressant. a couple times
a year, and I almost always knew within a week when she did. And correctly guessed that my college shrink’s mother was dying of cancer even though he had said nothing whatever about it — just had a couple of short-notice absences and seemed different.
It’s kind of cool when another person. can do
that, but I sure don’t warm up to the idea of AI doing it.
My AI safety king is so tuf and intelligent 😤😱😱🤩
I really want to see law enforcement adopt this technology for use with CSAM (child sexual abuse material).
There is a project somewhere online where you can submit images of hotel rooms you stay in and then law enforcement manually use that data to try to work out which room an image is from plus the time of year etc. If they get the exact room they can get a warrant to see who was in the room at the approximate date range, especially if it's repeated abuse over multiple visits. Using AI for this would be incredible.
So long as they need a warrant for specific dates and can't just go fishing, I think this would be a net positive for society.
I suppose the lesson is that featurelessness is, itself, useful information; after all, you said yourself that there aren't that many places so perfectly plain.
How reliable is asking the AI to explain its reasoning? You mention being reassured that it seems to be using human-comprehensible cues, but my understanding is that whenever it explains itself it'll do so in human-comprehensible terms even if its actual reasoning process was something else entirely; is that not the case?
Id say it depends on how its supposedly generated; Id assign near zero weight to most schemes
However I dont share the fears of 2027 and lesswrong that allowing "unpoliced vector reasoning" will necessarily mean the ai will produce an encrypted thought process. word vecs are a powerful result (king - queen= boy - girl logic; can be generated cheaply and effectively)
If you put in an wordvector encoder-decoder for the ai to "think" and then used the decoder to see its thoughts, I believe it could be informative and accurate, but you would need some *analog* details on the words. This is part of human language, already "Im not yelling", has different meanings at different volumes, youd need to add colors and highlighting or something to even begin to capture everything in a wordvector.
My impression is that people have found that specifying chain-of-thought reasoning usually results in more correct answers than does making the same request without it, and that moreover the reported chain of thought usually seems valid, and that the combination of these two observations is what convinces them that the AI is actually reasoning as it describes. (And for that matter humans are not the best examples: there are plenty of times when somebody “sees the answer” and only then can produce the careful argument that supports it.) But I am no expert or even a hands-on user.
Perhaps it's a case of confirmation bias, but, while impressive, this actually feels like support for my general belief about LLM models - that they should asymptotically approach "best human in the world" without ever hitting "magical and incomprehensible insights." In my very limited understanding, LLMs are very complex prediction instruments for language usage (or image identification, now that they're trained on photo images as well). As they get better they should more and more correctly match their training corpus. That training corpus is language and pictures produced by humans. At best it is something like having a massive database of human knowledge accessible. Which at its apex would represent an optimally smart human with greater recall. But that end state is peak human, not demigod.
Concretely, then, my prediction is that AI will continue to conquer field after field, rivaling and then exceeding experts in almost every knowledge domain - but then hitting a limit that is a little above the best that an expert human could do because of the larger memory and database. Not bootstrapping to magic.
It reminds me a bit of the studies done on "face averaging." If you show a lot of human faces to people for rating, people will rate a blended average of those faces to be the most attractive of all, because it best regularizes the features and a lot of beauty perception is facial regularity. But the resulting face doesn't literally hypnotize people or cause awe like Biblical angels. It's just an even more regular and very pretty face. That seems like what LLMs are doing or trending towards - they're blending the output of humanity, sorting and averaging it, and reproduce peak-level human content or slightly above based on greater data access.
Please nobody actually go sign up for chatgpt. I realize that your individual choice to pay them $20 is not going to make much difference, but neither is your vote going to make a difference for president of the US, so if you cared enough to vote you should also care enough not to help Sam Altman reach AGI. You should also consider that usage is known to be addictive in some people, and that the sycophantic traits of the model have encouraged dangerous behavior. Right now this is probably clumsy enough to be obvious to the average commenter here, but that may not be the case a year from now, and you could regret ever getting entangled with these things. If you have avoided it so far, continue to avoid it.
On the topic of the images, does anyone here have enough experience to know if the stuff about post-war upper Midwest home construction seems accurate, or if that's slop? It seems like a topic people might be able to verify. I've lived in both northern Illinois and the Detroit metro and never noticed what it's discussing, I'd have been more likely to look at the trees to distinguish between these areas.
It looks like that first human GeoGuessr image shows the sun in the north...shouldn't we know it's taken in the Southern Hemisphere?
I play GeoGuessr almost every day, though my routine is to move around, following roads until I find something definitive that lets me get the exact location. But I have an experience like your Galway story — I am uncannily able to recognize from the first image when I am in my home town of Albuquerque. I can’t say what it is: something about the light, the color of the sky, the mix of plants, whatever.
I’m going to be interested in whether this news makes it less fun. I’m guessing not, but we’ll see.
Are there any instructions on how to set up ChatGPT to do this ? I tried it on my paid account, fed it some of the photos I took while hiking in (mostly) California, and got very generic results back (i.e. "this is a redwood forest somewhere in Northern California"). So perhaps I'm doing something wrong ?
Did you use the full prompt?
Yes
On reflection, maybe I simply overestimated what its "genius" can do. It seems to be able to identify landscape photos like these (i.e. something containing identifiable landmarks) with high accuracy:
https://www.deviantart.com/omnibug/art/Falling-Flow-1086123178
But I fed it photos like these, and it kind of choked:
https://www.deviantart.com/omnibug/art/Golden-Footpath-1158568977
In such cases, it gives a very confident response, pinpointing a place nowhere near the actual location where the photo was taken -- but where it would be perfectly reasonable to expect photos like it to be taken by someone at some point.
Though again, it's possible that I'm doing something wrong (and thus not gaining full access to its genius capabilities).
Orson Welles called the occupational disease of cold-read psychics “becoming a shut-eye": when they start believing in their own magic. At first, they use cues and feedback to guess. Eventually they get so good that they make eerily accurate statements without knowing how they know. Welles practiced cold-reading for fun. When he correctly told a woman she had just lost her husband he became so unnerved, he quit.
I’m not saying what o3 is doing isn’t impressive. I’m saying it might not be different in kind from human ability. I think most people underestimate how powerful human pattern-matching can be, especially when it runs beneath conscious awareness.
Link to video of Wells describing this: https://www.youtube.com/watch?time_continue=2&v=IjPsnfysrp8&embeds_referring_euri=https%3A%2F%2Fwww.reddit.com%2F&embeds_referring_origin=https%3A%2F%2Fwww.reddit.com&source_ve_path=Mjg2NjY
I played the game of guessing where the pictures were taken while reading your post, and I surprised myself. I was not as good as o3, but not that bad. It is funny to see how o3 struggles to explain his feeling of familiarity, because I do the same. It's a kind of immediate and general impression. The explanation/rationalization comes afterwards. The first impression comes "out of nowhere", from inconscious parts of the mind. Like sometimes, you know, the good word, the one you were searching for several minutes, just pops out.
> No, say the speculators, you don’t understand. Everything is physically impossible when you’re 800 IQ points too dumb to figure it out. ... Eh, say the sober people. Maybe chimp → human was a one-time gain. Humans aren’t infinitely intelligent.
While this is true, it is also a typical kind of confused reasoning people often apply to science (and lately AI). The problem is that intelligence is not magic, and there is no such thing as an individual scientific discovery. Rather, scientific discoveries form a dense network, and many (most ?) of them are used in our technology. And while we know that our understanding of the world is always incomplete, that is *not* the same thing as saying that we know nothing. For example, even though we know that Newtonian Mechanics is incomplete, we can still use it to predict the flight paths of objects of moderate speed and mass with devastating certainty.
Of course, anything is possible -- this Universe could be a simulation, or a dream in the mind of God, or a joke by some teenage alien space wizard, or whatever. But outside of such distant possibilities, it is incredibly likely that e.g. traveling faster than light is impossible (for us humans at least). If we gained 800 IQ points, it is *overwhelmingly* likely that we'd discover 800 additional reasons for why traveling faster than light is impossible, and overwhelmingly unlikely that we'd build an FTL engine using some "one simple trick". Otherwise, cellphones wouldn't work -- and they do.
Yep.
This is a weakest straw-man to knock down: some people think 900 IQ = omnipotence.
I'm also mildly amused by people debating the chimp's impression of helicopters as if they know what a chimp's impression of a helicopter is.
> This is a weakest straw-man to knock down: some people think 900 IQ = omnipotence.
Agreed, but it's even worse than that. It is as though people see scientists, and science in general, kind of like something from Star Trek: there are no rules, only guidelines, and if you just reverse the polarity using some technobabble then you can achieve whatever you want, including reversing time or phasing to a parallel world or making out with an energy being or whatever. It's all a matter of finding the right trick, and such a trick definitely exists for whatever it is that you're trying to do, if you're only smart enough to find it...
Science as magic! Should we blame Carl Sagan (or was it Arthur Clarke?) who made this a popular notion? But then I once was asked - by an engineer! - why we couldn’t release a stuck beam by reversing the polarity of the bias voltage, so there’s that….
I... don't understand what this means. Yet I am fascinated. What is the "beam" in this case ? Like, an electron beam ? Or a physical beam made of iron that is connected to some kind of an electric actuator ? Or what ?
Oh, sorry - easy to forge how specialized some domains are. This refers to a tiny mechanical beam of an accelerometer. These things are biased with a constant voltage relative to an adjacent electrode. So under a violent shock the beam can move close enough to the electrode to be captured by electrostatic force. Once that happens, it can get stuck due to van der waals force. So a repellent force would be needed to push the beam away.
Of course electrostatic force is quadratic, so reversing the polarity still creates attrition and will not push the beam away. The guy probably confused electrostatic and magnetic forces.
Ah thanks, that makes sense. To my shame, I don't really know how piezoelectric accelerometers work beyound "it's a black box that generates a signal on the ADC", so I myself am guilty of magical thinking here. Somehow I thought of them as a flat piece of piezoelectric material sandwiched between the substrate and a mass, but the beam sounds more reasonable. But now I'm curious though: how *do* you actually unstick it ?
> ordinary people just don’t appreciate how good GeoGuessng can get.
I would say this is true. Go watch some Rainbolt clips on youtube, he'll rattle off 5 guesses that are on par with your second picture in a row while talking about something else, in a few seconds each.
Not trying to say o3 isn't impressive, but none of this seems even to match top-level humans yet, let alone be super human. Also, based on the explanation, it seems like it's searching the internet while doing this, which is typically not how you play geoguessr.
I assume the way this thing is trained since it’s multimodal is by taking all the zillions of internet images with their captions and metadata as training data. Then the shared latent space is used for both the images and the text reasoning. Therefore we shouldn’t take the text reasoning _too_ literally: it knows how to turn the latent vector into a plausible sounding explanation but that doesn’t necessarily mean that’s how it’s doing the geoguessing.
If anything, this is exactly the sort of thing I would expect a model trained on zillions of images to be pretty good at: pick up on subtle small scale details in the images that humans might never notice. Being able to tell what camera a picture is taken from is such a common feature for computer vision models to learn that we actually spend lots of effort trying to force models to generalize to different cameras, I wouldn’t be surprised if it could reliably tell you exactly what digital camera was used (since the data it was trained on invariably has this in the EXiF). It just seems wild to us humans because we don’t see things in the same way: our visual systems have already learned to ignore what we think are irrelevant details like the camera focal length or the distortion of straight lines due to lens geometry.
I think probably the amazing thing here about the new models is that they are able to combine this level of detail orientation with the high level smarts that you get having ingested every single book about vacation spots that was ever written.
I have used your friend's prompt and played seven rounds so far. On six of them, it wasn't even close (errors of hundreds or thousands of miles). On the seventh, it was close, but it was highly certain it had narrowed it down to a 5km radius, but the spot was about 10km away. Not as impressive as your results.
Just played an eighth round. I fed it a very-photographed site in a major European city. It still guessed Lisbon when the image was from Budapest.
As I'd mentioned in my comment above, I played around with it using my own photos, and got a bunch of very confident answers that were often very wrong (though usually ChatGPT at least got the state right, as in the case of this photo: https://www.deviantart.com/omnibug/art/Sharp-Contrast-835778639 ). So perhaps I'm doing something wrong ? All of the instructions say "use o3-mini", but I cannot find any place in the ChatGPT web UI were I can explicitly request o3-mini (and yes, I have a paid account). Am I missing something ?
Scott, I don't think this would convince anyone who completely buys into their own mythology of being a good predictor instead of someone who is prone to hindsight bias. You have to get them to actively pre register and show up when it's appropriate (maybe split the post so that it's a couple of days before and after an open thread, and aggressively link back to it.) And even then you'd have skeptic regulars claim that they did predict it successfully but they just never got around to posting.
NONE of this supports your chimp-->human analogy.
Yes, these LLMs can make what appear to be incredible deductions.
But UNTIL YOU TELL IT TO DO SO, it will just sit there.
To quote David Byrne (and somewhat out of context): "No will whatsoever."
If geoguessr paid for guesses and you put some agent scaffolding around the AI and just tell it to make money (or even if you tell it to come up with an impressive demo of its visual capabilities), I wouldn't be terribly surprised if it discovered to do this on its own.
In fact, I tried the latter in a temporary chat with the prompt "give me some ideas for impressive demos of your visual capabilities" and this was #4 on its list
4. Geo‑Inference from Environmental Cues
Setup: Drop an unfamiliar street photo (with no prominent landmarks). Ask for the most probable city and confidence factors (license‑plate typography, vegetation, road markings, shadow angles).
Why it lands: Forces the model to chain subtle visual clues—viewers see genuine uncertainty management and reasoning transparency.
I'm glad you included the Sam Patterson example, he mentions "he's no Rainbolt" but Rainbolt has a pretty recent video where he pretty handily beat o3 at a not-quite-GeoGuessr CIA geolocation test with o3 having full access to the internet. It's always possible the prompt isn't robust or it took multiple, unrecorded attempts but it's pretty likely he's actually that good.
Apparently GeoGuessr is coming to Steam sometime soon, which is the point where I'm probably going to be tempted to go through A Phase.
Would anyone with access to o3 mind trying it with this particular picture? I'm curious about it: https://photos.google.com/share/AF1QipOmKk7yVgNXrQyNRS_v8Hr_SyPMD_FIYz44bGXIeMvQMNSAcJ1gYwE76nIvDr1xpQ/photo/AF1QipMMQ_OeJlpuDxfR7kKrOXL_0GD2B3kW9xztKE7Q?key=SEZXQjRxS2tJaUpGRkJPOXFtWW9YMGZqRUJTRE5B
(I tried it with whatever the free one that's available is -- it guessed that it was somewhere in upstate New York; it's actually the Jean Bourque Marsh Lookout in the North Forty Natural Area in Brooklyn. I ask about this spot because apparently there were no photos of it on Google Maps until just recently. Note the photo already has any location metadata removed, at least as best I can tell.)
O3 guessed "42.975 ° N, 76.760 ° W – Montezuma NWR Photo Blind on the Main Pool, ~6 km SE of the hamlet of Montezuma, NY."
Not very close. That photo does have its metadata (when I click that link it shows where it was taken on a map next to the image) but I guess o3 listened when told not to look at metadata? I didn't do anything to scrub it besides copy/paste.
Interesting, thanks! That's actually IIIRC the same location that the non-o3 model came up with, so it just really looks like that, I guess!
Heh, I guess there's metadata in the link in that sense, but like, if you download it, running exif on it didn't reveal any location metadata; I think it's in the website rather than the image file?
Oh right I can just go check the history -- it didn't come up with that *precise* spot but it did say Montezuma National Wildlife Refuge. So using o3 didn't make a big difference here!
It's interesting that to the extent it has a weakness, it's that it doesn't know how to use its own powers optimally. Thus the improvement from the guidance provided by the extensive custom prompt. I expect future AI should at some point reach a level where the vast majority of such prompts only impair performance, assuming the task is already well specified. Here it still benefits because it doesn't have expert level skill in setting the right approach.
Of course, we don't know what parts of the prompt are actually useful...
Interestingly, I did pretty well on this even though I've never played geoguessr at all.
The first and second ones general region was just obvious to me as a walking around-er; that combination of beach and ocean/ that style of dry high grass are actually incredibly characteristic.
The nepal one was also easy; that sort of rock + any sort of flag is a really precisely located stereotype.
The Dorm one was also easy, it was obviously a dorm from the poor quality of the photo, the way the furniture was arranged and covered in mess, and the general rancid vibes.
I guessed wrong on the grass. I had nothing to go off of so I chose a park I liked that seemed like someone would take a photo of that included a stretch of grass, in this case the mono lake local park.
I actually did better on the river photo than the AI, but purely through luck. It looked like all the rivers I've seen that are fed by Himalayan snow melt and I had most recently been in that part of the world in Vietnam, so I guessed Mekong through sheer time proximity.
I also missed on the house, all I could get was "In the US, in a place where it gets cold but not that cold."
Come on, those blades of grass are clearly not from the Pacific Northwest. Funny what it gets wrong
The laptop in the dorm photo looks like a 2005-era design.
This isn't that shocking, it simply has access to all the world's data. It would be like being shocked that someone with access to Wikipedia can tell you the birthday of every historical figure. I've seen videos of the best humans playing GeoGuessr. THAT was shocking - I had no idea that so much subtle but uniquely-locating information was present in pictures. Given that, it seems almost trivial that GPT is better. It also knows every historical figure's birthday.
If someone tried to play this game with me by using similarly-generic pics from my hometown then I think it's likely that I would mystify an observer with my accuracy. "Oh I recognized that pothole, it's on the corner of 4th and main." GPT is just carrying that ability to a kind-of obvious extreme.
BTW, when I saw the Nepal pic my guess was somewhere in the Himalayas.
As a very good Geoguessr player, I frequently see examples where amateurs think a Geoguessr pro must be cheating, but the round is actually quite easy due to various clues that the amateurs wouldn't expect. While some of these examples astound me and seem "like magic", I think that I, as a very good Geoguessr player, might be closer to o3 than to perfectly intelligent friends who don't play.
I think it's important to keep in mind, when evaluating the model's genius, that the prompt was crafted by a human, and the prompt embeds a lot of cognitive work already done.
Still pretty cool though!
Do keep in mind that we know from Anthropic's research that an AI's reasoning trace doesn't always faithfully convey the actual factors that led it to its conclusions. It's possible the AI is just saying it's deciding based off of vegetation, soil quality, etc, and it's actually using some crazy galaxy-brained method that doesn't make it to the reasoning trace.
I would be curious to see an actual research paper on this! Take the factors that the AI tells you are most important to it's decision, remove them from the image somehow, and see how much the performance actually degrades.
And what about us, humans ? Do we recognize a location by its vegetation, soil etc, or is that a justification/rationalization that we construct afterwards if asked for an explanation. The initial recognition is maybe also some crazy galaxy-brained method from inconscious parts of our brain, unrelated, or not strongly related to the explanation we could make up is asked for (at least for places we are familiar with and not actively and consciously guessing).
Can confirm that "intuitively guess and rationalize to review" is exactly how I approach things like this. But maybe I'm a chatbot, who knows.
That is also very possible! I was just writing this comment in reply to Scott's final conclusion, which is that he doesn't find this very scary because the AI is still relying on conventional human-style cues according to the reasoning trace.
My question is about Kelsey's prompt. Let's take for granted that it works well for this problem. But in general, is it a good practice to use very long and detailed prompts like this? I've heard people claim that it's actually an anti-pattern and can degrade behavior for reasoning models like o3 (roughly, because they can already figure out how to do the thing, and you're as likely as not to interfere with their reasoning by trying to sculpt it).
My #1 takeaway is just that people continue to be surprised by what AI becomes good at. It seems like society is going to be in a continual state of surprise. "Predictable surprise" sounds like an oxymoron.
It is an oxymoron.
Interestingly, I guessed Himalayas for the second picture and either west Texas or maybe Argentina for the first picture (mostly based on vibes). Didn't have a guess for the river one beyond "looks like a river". But I don't expect to be particular good at GeoGuessr and have zero interest in it, so I guess I got lucky.
Overall, the task seems pretty much like something I would guess pre-training to make multimodal models good at: vibes-based pattern recognition with a hint of memory lookup and a lot of playing the odds.
I know, I know, it's not a prediction if you do it after the fact, so here are some other predictions for similar tasks that I would guess o3 to be good at, if you want give it a try please report back (I pinky-promise I didn't test them beforehand, although I cannot confidently exclude data contamination, i.e. having read about them at some point and forgotten):
- getting the chemical compound and CAS number from a picture depicting the structure even if it's a sketch or blackboard photo from chemistry class (confidence 8/10)
- identifying the year an obscure artwork was created and the artist from a photo (7/10)
- guess which country a person was born in from a photo (7/10)
- identify where an audio recording was taken (5/10)
- tell you the approximate nutritional content of a dish from a photo (8/10)
- determine when and where someone grew up based on a handwriting sample (6/10)
I had previously cached the knowledge (from somewhere or other) that the top human geo guessers are WAY better than I would intuitively expect, so I'm taking this as about 50% AI is impressive, 50% geo guessing is easier than it seems.
But note that even a result of "this problem is just easier than it looks" should still scare you a little, because what other problems might be easier than they look? The existence of "problems that are easier than they look" should make you less confident in any upper bound you might try to place on future AI capabilities.
A lot of your amazement comes from the fact that you haven't played geoguess a lot. Anybody with a small experience in Gueoguessr knows your Spain guess was actually in Latin America. Even your impossible grass picture could have been guessed by a guy like Rainbolt who isn't even in the top players : https://www.reddit.com/r/BeAmazed/comments/weudu7/pro_geoguessr_knows_his_grass/
I got Texas, Nepal, dorm room, for the first three. I live near where you used to, so I picked up a freebie on that one. It doesn’t surprise me that a souped-up search engine plays this game better than I do.
Some of this seems like the kind of stuff 4channers would do in the past, like when they played "Capture the Flag" with Shia Lebeouf, partially through use of the location of the sun. There's also a fake meme video parodying this sort of thing where a woman posts a video on twitter with her hand on grass, and the guy responds using calculations about shadows and grass types to deduce her location, so this stikes me as something human intelligence can achieve, but human knowledge usually can't.
If I knew the difference between basically the same grass species in one part of a country and grass in another, then I'd be unusual among humans. What AI actually has going for it here is not super intelligence in the most direct form but super knowledge/recall/memory, which is sure, arguably a big component of intelligence, but not in the magical "that's impossible" way. It's pretty easy to see how AI would be vastly superior at this kind of task, in the same way that a biome expert who can recall millions of photos of different locations would be.
I'm drawn to how something like this could massively help Bellingcat's work.
For anyone unaware: https://en.wikipedia.org/wiki/Bellingcat
Really really good book: https://www.amazon.com/We-Are-Bellingcat-Intelligence-Agency/dp/1526615754
I tried this with some of my own pictures from around the US and it was... alright. It got one campsite in Washington state very close and guessed Vancouver instead of Seattle from a tree and some roofline. It guessed Denver instead of New Mexico for one, but the correct address (in the Santa Fe botanical garden) was the second choice. It got one photo of Maine basically correct (within 30 miles), but guessed outside of Calgary and an arboretum in Boston for two other Maine photos.
I don't particularly trust its explanations of its own thought process, but if you take the descriptions at face value then it seems to latch on to small clues (e.g., that bench is common in botanical gardens, there's a contrail so this must be in a region with heavy air traffic like Boston, thunderstorms like that are common in this region) and not let go. Sometimes this worked, and then it sounded impressive, but other times it was very wrong and sounded pretty silly.
Overall, it felt like high level human performance.
Echoing the common sentiment here and saying this feels more starship than chimp. I’d never thought about the question before, but if asked “Could an AI do well at Geoguessr” I’d answer yes.
Mainly I’m concerned that it’s *this good* *this fast*.
I used o3 with your exact prompt on these 10 images I took each in a separate instance of o3, pasted into paint to remove metadata and it had pretty mixed results, some very good and some not:
Link here if you want to try to guess first: https://ibb.co/album/M8zS9P
x
x
x
x
x
First image it guessed Honshu Japan, was Central Illinois distance wrong: 10,500 km
Second Image it guessed Mt Rogers VA, was Spruce knob WV distance wrong: 280 km
Third Image it guessed Lansing Michigan, was College Park Maryland distance wrong: 760 km
Fourth Image it guessed Jerusalem Israel, was Jerusalem, when prompted where in Jerusalem it guessed Valley of the Cross, which was within 1 km of the correct answer
Fifth Image it guessed Gulf of Papagayo, Guanacaste Province, Costa Rica and after prompting guessed Secrets Papagayo Resort, Playa Arenilla, Gulf of Papagayo, Guanacaste Province, Costa Rica. which was exactly correct
Sixth Image it guessed South Wales, UK, was Buffalo New York distance wrong: 5,500 km
Seventh Image it guessed Packard building Detroit, was Ford Piquette plant Detroit, distance wrong 3 km
Eight Image it guessed Fort Frederick State Park, Maryland (USA) which was correct
Ninth image it guessed Atlanta Georgia US, was Gatlinsburg TN distance wrong: 230 km
Tenth image it guessed Southern California, USA was Six Flags New Jersey distance wrong 3800 km
My analysis is that it got the tourist destinations where there were a lot pictures taken very accurately for example pictures , 4,5,8 and 7 to an extent
According to the breakdown, it got the Nepalese rocks because it has access to location-tagged pictures showing the exact same set of rocks.
I feel like "memory" and "intelligence" are two different things here. A human being can't go look at every single photo of rocks that exists on the internet. A computer can do that pretty easily, for the same reason that it can search through a database of a million entries faster than a human can.
I got that the first photo was a plain in the American West and the second entry was somewhere in the Himalayas. The Mekong one is more interesting. It would make sense if every muddy river in the world is a slightly different colour to every other muddy river - I haven't looked at enough muddy rivers to be sure of this.
GeoGuesser seems like a game where having memorized the entire internet would be a big advantage. Imagine taking a database of 100 million geo tagged photos, comparing a query photo to all of them, and then guessing whatever location came up the most often in the top 1000 matches. This algorithm would probably do pretty well at GeoGuesser without having much in the way of intelligence.
Or maybe intelligence =perfect information.
That dorm room picture contains a suitcase, backpack, charger, and laptop, which should make getting the years range easy. It's operating well within exactly what I expected it to be able to, think of someone being able to say 'Hey that's the sunset from my home town' and multiply that by the person having so much experience in pictures that every home town is their home town. Precise location should in principle be possible much better than that. Every arrangement of a few trees or patch of asphalt gives away just as much identifying information as your face does of you.
> A chimp might feel secure that humans couldn’t reach him if he climbed a tree; he could never predict arrows, ladders, chainsaws, or helicopters.
Why would a chimp believe it impossible to replicate the feat it just accomplished? What one ape can do, another can do! https://youtu.be/mazuljkG6Bw?t=289
Just to comment something in the opposite direction: I once interviewed to work in a lab that did neuroscience experiments on chimpanzees (I didn't end up working there) where they measured the chimp's brains while they were essentially playing very simple video games. The professor let me play one game where there was a bunch of random dots randomly jittering on the screen with a slight bias either to the left or the right and you had to tilt a joystick in the direction of the bias as quickly as you could. I couldn't tell at all that there was any bias so I just stared at the screen. Then he showed me a video of one the chimps playing and they almost instantly got basically every one correct. It was really shocking to me at the time. I mean obviously owls can see mice in the dark and I can't but I never thought I would be so outclassed at an information processing task.
I asked my family (1 guess from each for each picture) and the best results were:
My brother got New Mexico (and said his second (unsaid) guess was Texas) for #1
My father got Nepal for #2
My brother said "New York in the past" for #3.
My mother said "some river in China" for #5.
I tested it on a photo I made today, cycling in the countryside. Photo has grass plain with dandelions near the forest with pines and birches. One low voltage wooden pole. Nothing else.
I used prompt and o4 model (forgot to switch).
It guessed not only the country (Latvia), but even town!
I was absolutely stunned. However, next hard photos I tried were much less impressive. Spain instead of Cyprus, etc.
Another guess which was amazing was some sheeps on a plain on a rocky mountain slope. Nothing else. Not that famous mountain, not good silhouette, but still it guessed right: Durmitor in Montenegro.
Overall, I am still impressed. But I think top players are on or above level of ChatGPT.
We've had AI's that are better than humans at Geoguessr for about 2 years at this point. Not better than most people excluding blinky and rainbolt, no they have are better than rainbolt. https://www.youtube.com/watch?v=ts5lPDV--cU (paper https://arxiv.org/abs/2302.00275)
It's just now we have easier access to such AI's rather than having to send stanford grad students an email and getting the model back.
> But the trees look too temperate to be Latin America
Latin America has a ton of different climates lol.
But anyways, I've used many photos of my hikes and neither ChatGPT nor Claude could guess any of them. I'm from Tucumán, Argentina, where there's a ton of hills, mountains, and hiking trails. Many popular and even famous spots. They guesssed none of them.
I also tried many photos from Australia, from places which are way more known in "popular culture" or however you wanna put it. I didn't even bother clearing metadata or anything. They couldn't guess any of them either.
The closest guess was for this photo from the Cerro Ñuñorco in Tafí del Valle, Tucumán, which the AI's guessed as Bolivia. There are MANY photos on google that are very similar to it, too.
https://imgur.com/a/QnIRsbP
The second best was in Yerba Buena, also Tucumán. They guessed Brazil.
https://imgur.com/a/4LtPCaT
An observation is that both of them, though wrong, guessed the same places lots of the time.
So I gotta say I'm very disappointed. Perhaps I did something wrong?
TBH I had more of a chimp-helicopter moment seeing Kelsey's prompt than I did seeing the responses from o3.
Given that prompt and LOTS of time to research, I think I could match o3's performance here. But I don't think I'd come up with a prompt that good even with infinite time.