284 Comments
Comment deleted
Expand full comment
Feb 10, 2023·edited Feb 10, 2023

> In some other case you have 0 idea at all. I have 0 idea how far some specific exoplanet is from Earth. Thus I'm likely to make wild guesses that cover my bases. Uh, 5 light years? Uh, 1500 light years?

~~It isn't possible to detect exoplanets 1500 light years away, so you can rule that out. I have no idea what the practical limit is, but my own random guess is that all known exoplanets are <60 light years away. ~~

Edit: I just looked it up and I am *completely* wrong about this. Per Wikipedia, the most distant candidate exoplanet is 27.7k lya. The most distant directly visible is 622lya. Shows how much I know!

As for 5 light years, there's only like one or two stars that close. So it depends on whether you think they might be talking about the exoplanets of those stars specifically (which they might!)

Expand full comment

I think your second poll question's caveat that you were off "by a non-trivial amount" may play in here. If I was really confident in the distance from Paris to Moscow, or the weight of a cow, my second guess would be pretty close to the first. But the way the question was phrased, most people would feel compelled to change it up for their second one, even if they were very confident the first time.

Expand full comment
author
Feb 6, 2023·edited Feb 6, 2023Author

I agree, but I didn't know how else to prevent people from putting their best guess down twice. I can't ask them to wait until they forget, because most people only take the survey once, and it only takes a few minutes. Does anyone else have any ideas?

Expand full comment

Ask people about their confidence level after the first guess, and exclude people who are close to 100% confidence?

Expand full comment

"Put the questions further apart on the survey" seems like a useful change to make at the margin, though it might not do much on its own. And since some people won't remember what they guessed earlier, you'd want to replace the "assume your earlier guess was off by a non-trivial amount" language with something more like "if you remember your previous guess, try to disregard it". Which as Dave says is closer to what you were actually looking for.

Expand full comment

It might also be worth doing some bias checks, too. Like have a large number be part of an estimate or question directly preceding one of the guesses to see how humans' previous number bias impacts the accuracy of the crowds' wisdom.

Expand full comment

I thought that's what he was doing, with the "number of unread emails" question immediately before the second guess. That will be high for some people and low for others, and I thought he was using that to investigate the anchoring effect.

Expand full comment

But in every example of a use of the anchoring effect I've ever seen there has to be an actual number used to act as the anchor. And, maybe it was, I was one of the folks who didn't do the survey. Not enough bandwidth and my reading of other materials has been haphazard.

Expand full comment

A very good answer. You can also switch up the wording a little. I think this is commonplace on psych tests, where they essentially ask the same question twice well separated, and maybe phrased differently, to indirectly probe how confident the responder is in his answer.

Expand full comment
founding

When I answered that question I was very confused. I figured you were trying to measure something like this, but I still wanted to put the same number down--I'm just as likely to be too high vs too low, so it's still my best guess.

The question makes a lot more sense if there's space in between. I'd suggest doing the same thing next year, but making it the first and last question, and hoping people forget in between.

Expand full comment
founding

>I agree, but I didn't know how else to prevent people from putting their best guess down twice.

Maybe say "use another method to estimate the value" or "what value do you think other people would guess"?

Expand full comment

Potentially just ask users to re-estimate without giving any new data (simulated or real). Eg. First question is "Distance from X to Y to nearest Z km", where Z is some reasonable bucket of guess (maybe 100 or 50). Then second question is "Assuming your first answer is incorrect, guess again". That is hopefully the bare minimum information necessary to force a recomputation without pushing the data too far out.

The disadvantage of this is that (I'd assume) it will be worse at correcting massively incorrect answers, though it may be better at refining reasonably good answers.

Expand full comment

I already run a sort of inner-crowd algorithm when making these sorts of estimates (e.g. try a guess on myself, decide that sounds too high/low, adjust, try it on my self again, etc.) and I assumed everyone else did as well. If I'm atypical in that regard, then maybe you could ask a pair of questions with an A/B scheme and ask them to consciously use the algorithm on one but not the other.

As it was, I put the same number down twice, thinking, "If I felt my answer was wrong in one direction, I would have already adjusted it in that direction. Maybe it's a logic puzzle to see if I know that any adjustment I make should be as likely to be more wrong than more correct."

Expand full comment

I ran it as a kind of ranked-choice voting. I put my best guess down as the #1 candidate. Then, reading the instructions, I thought: if my best guess is wrong, in what direction might I think it most likely to be wrong? and put the second-best guess down as my #2 candidate.

Expand full comment

That makes sense for discrete choices, but is not for continuous ones.

Expand full comment

I don't see why that is true. Besides which, my guesses for a distance like that actually will be discrete, simply because it would be silly to guess 0.00013457 km less than the first guess. There's some modest quantum below which no reasonable person would go, say 10-100km, depending on your level of confidence.

Expand full comment

Ask one in kilometers and the other in miles? They can't reuse the same number so they're more likely to reassess from scratch.

Expand full comment

I am skeptical that that would be helpful. I conceptualize long distances exclusively in miles, and I did my thinking in terms of miles and then converted to km at the end. I suppose I assume that most people would work the same way (solve the problem in the familiar unit and convert to the unfamiliar unit), and then any difference between the answers would be unit conversion error. But perhaps that's typical mind fallacy on my part.

Expand full comment

I also think in miles, but I originally thought this question was measuring something like metric system awareness, so I deliberately avoided doing the conversion and made a best effort guess in km (which turned out to be horribly off compared to what I would've said in miles).

Expand full comment

A related experiment might be: 1) give your best guess

2) If your best guess is off, do you think the true value is higher or lower than your guess?

I suspect honest participants would be right significantly over 50% on the second question. (One could always guess way off deliberately to make question 2 too easy.) Even when we give our true best guess, I suspect we have some subconscious idea as to how we might be wrong and by what magnitude. But we lack the intuition to give a "weighted" best guess which combines our favored value plus our opinion on how we might be wrong.

Expand full comment

Ask them once. Ask them again much later. Then ask them to tell you, without looking what they think their first guess was. Now at least you have data about who remembered their first guess well and who didnt. With the second group being the better set.

Aside, speculative, a tool to help them forget might be to ask to an absurd number of decimal places

Expand full comment

Ask a numeric question that isn't easily googleable and offer a small prize for the closest guess. Putting some distance between your two guesses would increase your odds of winning, but people who are very confident that their first guess was close will still guess a nearby number.

Expand full comment

In real life, I do an inner crowd estimate by forcing myself to guess an upper bound, and a lower bound, and then average those. I feel this is more accurate than a simple guess; but I haven’t studied it. Could you ask for a high/low?

Expand full comment

At the risk of overcomplicating things, instead of asking for a number, ask for an analogy: "What major city is approximately the same distance from LA as Moscow is from Paris?" Then later, either imply that the first guess was not the closest, or just ask for a reconsideration guess in the same fashion as before. Technically, it's a more difficult question to answer, but that doesn't matter because we're not really testing knowledge here, and there's enough fuzziness that I think most people would appreciate the chance to hazard a backup guess.

Major downside is that this requires way more data entry because someone has to calculate the distances to every city inputted and plug those values in to get the data you're really after.

Expand full comment

The trouble with asking a different question is...it's a different question. Suppose the first question was the distance from Los Angeles to New York, and the second from Los Angeles to Honolulu. They are pretty close (about 2450 miles against 2550) but some people may think the distance over the ocean is significantly bigger than the distance over land. Or vice versa. Or maybe that the mountains make a difference, or the geodesics, or the lizardmen prefer water over land, so make the distance shorter.

I think, as has been mentioned, the best way is to ask the same question twice, and ask not to refer to your own previous answer. Then throw out answers which have the same answer for both questions, as probably being people of best memories. This, of course, has its own problems, including:

1. Eliminating people with the best memories, some of whom might actually remember the answer the first time. This would tend to make the average worse.

2. The second question might actually be asking what you think you remember, rather than having you make a second guess.

3. You may be throwing out those that estimate most consistently, if they don't remember what they guessed but happen to guess the same answer again. This is somewhat more likely, given that people are more likely to guess in round numbers, like 3000km rather than 2991km or 3009km.

Expand full comment

Ask the question on two successive years' surveys (linking people using the identifier string).

Expand full comment
Feb 7, 2023·edited Feb 7, 2023

Maybe this is about phrasing. "Just throw out your guess and guess again" is different from "your first guess was wrong". Some people might not get it, but at least the other answers wont be biased by a new piece of information in the question.

Expand full comment

The original study this is based upon is the beautifully titled "Crowd Within" by Vul and Pashler (https://journals.sagepub.com/doi/pdf/10.1111/j.1467-9280.2008.02136.x). The study found the same thing you found/hypothesized: (1) WoC is a lot less powerful within people than between different people; (2) it works better when people have time to forget their initial answer than when they are immediately prompted to offer a second guess.

So, no, this is not the silver bullet of prediction. It beats your first hunch, but not by a wide margin. The improvement is probably even less when (as some of the comments here suggest) the respondents are savvy forecasters who already put a lot of thinking into their first estimate.

Expand full comment

"Someone who knows your geography skills saw your first answer and said: 'Come on, I expected you to make a *better* guess.' Try again."

Like, something ambiguous enough that a person with low confidence could interpret it as "add or remove 1000 miles", and someone with high confidence could interpret it as "almost correct, but add or remove 1 mile".

Expand full comment

This is a good one.

Expand full comment

Agreed. I wonder what happens when one only considers answers by people for whom the "you are wrong by a non-trivial amount on your first guess" assumption holds.

After a small amount of thinking, I'm not sure what that would mean in practice, but here is what I came up with.

Considering only people who answered both questions in the open survey results (6379 people), and letting A1 be their first answer and A2 be their second answer, then:

|geometric_mean(geometric_mean(A1, A2) over everyone) - 2487| = 351

Meaning, the ultimate crowd wisdom is off by 351 km. This differs from Scott's 243 km. It's quite a difference, I'm not sure whether I misunderstood what he did to obtain that number, either of us made a mistake, or the people who didn't want their answers shared were particularly knowledgeable about European distances. FWIW I'm not at all confident in what I'm doing.

I then calculated the same thing but only for those people who were off by at least 500 km in their first guess (4911 people), and that group was off by 520 km, while the remaining 1468 people were only off by 149 km. So having a good first guess leads to a much better overall guess, which I suppose is not surprising.

But would this latter group of people do better to stick with their first guess or is the second guess still helpful? The second guess is still helpful! The geometric mean of only the first guesses of the people who were off by less than 500 km in their first guess is off from the real value by 267 km, while as I mentioned above taking both guesses into account leads to a 149 km error.

But I picked 500 km rather arbitrarily. Clearly people who were outright right on their first guess should do better to stick with their first guess, so we should be able to determine what "non-trivial amount" means empirically by finding the cut-off error below which you would have been better off sticking to your first guess. It turns out that this number is... 487 km. Weird how close to my arbitrary choice of 500 km that is.

Expand full comment

Yeah that one got me too. I was decently close on the first one (I imagined the distance needed for all the war letters I’ve heard to be as dramatic as they were on the podcasts I’ve listened to) and surprisingly got pretty close. Then I way overestimated the second go around.

Expand full comment

Also, different people might interpret "a non-trivial amount". I'd consider even 30% "trivial" in the context of talking to someone who hasn't personally travelled from Paris to Moscow or back before, but I can imagine someone considering anything over 5% "non-trivial".

Expand full comment
founding

This is exactly what happened to me. My first guess was pretty close, at least in the right ballpark. My second guess, I tried to figure out what might have made my first guess off by a non-trivial amount in this hypothetical, figured "well OK what if I was way off in my from-memory estimate of the circumference of the earth, or how far around it paris->moscow is?" and gave a wildly high guess that was WAY WAY worse.

Expand full comment

I agree. Not knowing which direction and knowing it was significantly off made me play a game. Suppose my first guess was 5000km. If it was way off, my inclination would be that it was too low, and making a second guess of 6000km is not "significant" enough. So, basing my guess on both the distance AND knowing the first guess was wrong would seem to make 8000km a reasonable second guess.

Expand full comment
Feb 8, 2023·edited Feb 8, 2023

That would have worked for me, as I initially guessed 2000km, and if I were to guess again after that warning, I should certainly have picked either 2500 or 3000.

Thinking on that, I must have known that my first guess was much more likely to be too low than too high. Maybe the magic works by forcing people to offset this kind of consideration on the second guess?

Expand full comment

This happened to me. I knew that the distance from Cologne to Kaliningrad is famously 1000 km, so my first guess was 3000 km. So, supposing that this was very far off, I said "oh blimey" and chose 4000 km for the second guess.

Expand full comment

This feels a bit like a human "let's think step by step" hack. Also, seems like some part of this benefit is obtained from common advice to "sleep on a important decision" and not make super important decisions impulsively.

Expand full comment
author

That's an interesting analogy, but I would have thought the advantage of sleeping on important decisions is considering them in two different emotional states; I wouldn't have expected emotional state to impact estimates of distance to Moscow.

Expand full comment

I've read about how negative emotional states lead to lower numbers in estimates and guesses and positive emotional states lead to higher. It might take me a quick minute to find the reference.

Expand full comment

Wouldn't it depend on the valence of the thing you're estimating? I'd be surprised if that held true for estimating, say, child mortality rates or number of people shot by police.

Expand full comment

Valence of the thing your estimate might have an impact, but current emotional state prior to reading the question or topic also seems to have an impact, too.

Expand full comment

I'm sure it does, but I'm hypothesising that the direction of that impact will be *opposite* depending on the valence of the thing.

I would be extremely surprised if happy people predict higher numbers than sad people for everything, including bad things.

I'm not trying to minimise the effect of emotional state; I'm just trying to give it a directional multiplier.

Expand full comment

Your hypothesis could be correct. I don't know. I can't remember from the abstracts and books I've read. But, it might not, our brains do surprising things all the time.

Expand full comment
Feb 7, 2023·edited Feb 7, 2023

People are likely to get emotionally attached to their first thoughts or guess (by some tiny amount) before they've uttered it and while they're still thinking about it. People like being correct more than they like being wrong, and they like assuming questions are approachable at first sight instead of admitting their confusion. This is even true if they're estimating distance to Moscow to answer a survey question and they previously did not care either way what the distance to Moscow is.

Asking them to not over-anchor on their first thoughts and think again with fresh eyes allows them to work against this bias.

Expand full comment

Test scores would say that emotional state can matter even on empirical information. Some people don't get enough sleep or are dealing with some other type of issue and do more poorly than if they are rested and thinking more clearly. Also, I think we tend to subconsciously, and sometimes consciously, mull over the question and add information beyond what we thought of for our first guess.

Expand full comment

I think you need a hypothesis about the underlying reason for wisdom-of-the-crowds improvement before you can compare the underlying reasons and conclude it's different in the "sleep on it" case.

Expand full comment

It's also just like the consultancy trick of saying, "assume we are looking back with regret at how we managed this. What did we do wrong." It's very much a "assume your thinking was incomplete, apply more dakka."

Expand full comment

I’m mad because I was actually super happy with how close my first guess was - but I didn’t read the question right and guessed in miles, not km. My second guess was in the wrong direction, anyways, so i mostly just got lucky.

Expand full comment
founding

I'm out of the loop: OP == "overpowered"?

Expand full comment

Correct.

Expand full comment

Correct, and I suspect it comes from gaming culture. In League of Legends, it was common for a new champion to arrive that was "improperly" weighted to be too powerful. It would be said that champion is OP. Then, when the champion's stats were "fixed" to make them more aligned competitively with the other champions, it was said to be "nerfed." Also, "nerfed" was common to apply superlatively to say some character/champion/weapon had been made undesirably weak.

Expand full comment

"Nerf", of course, referring to the brand of toy guns that shoot foam darts, because what better way to describe a power reduction than to analogise to a real weapon being replaced by a NERF version

Expand full comment

In theory if there is no systematic bias the error vs crowd size graph should be an inverse square root, not the inverse logarithm you fit to the curve. This follows from the central limit theorem if we have a couple assumptions about individual errors (ie finite moments).

This actually makes the wisdom of crowds much more impressive as the inverse square root tends to zero much more quickly.

Expand full comment

I think the poll's instruction to assume that your first answer was wrong by some 'non trivial amount' is important. It's effectively simulating the addition of new data and telling you to update accordingly. Whether the update is positive will depend on the quality of the new data, which in turn depends on the quality of the first answer!

ie. If my first answer was actually pretty close to reality (mine was; I forget the numbers and the question now but I remember checking after I finished the survey and seeing that I was within 100km of reality), a 'non trivial' update is pretty likely to make your second answer worse, not better. That's quite different to simply 'chuck your guess out and try again'. It also suggests that ACX poll-takers may be relatively good at geography (compared to... pollees whose first guess were more than what they think of as a trivial amount wrong? I don't know what the baseline is here).

Without reading through all the links above it's not clear whether the internal crowds referenced were subject to the same 'non trivial error' second data point. In the casino presumably there was some feedback because they didn't win, but I don't know how much feedback. I'm about to go to bed so I will leave that question to the wisdom of the ACX crowd and check back in the morning.

Expand full comment

Same here. My initial guess was extremely good (by coincidence - I produced 2500 by a method that would derive that number for basically any real distance between two and four thousand if it was sufficiently close to west-east line), so, when I had to correct, I made it 3000.

Expand full comment

...now I'm really curious what your method was.

Expand full comment
Feb 8, 2023·edited Feb 8, 2023

Sure! Here it is (with real-world comments where needed in parentheses).

Paris-Moscow is more-or-less close to being west-east, so they're on a parallel. Equator is 40000 km. The relevant parallel is probably somewhere between 45° and 60°, closer to the former (the real latitude is 48°50′ for Paris, 55°45′ for Moscow) because St. Petersburg is around 60°. Thus the parallel's length is between about (two-thirds of 40000 km) and 30000 km (the real length of 45th parallel is somewhere about 28440; real parallels of Paris and Moscow are shorter but their not actually being identical would lengthen the hypotenuse); I use the upper bound because they're probably not _exactly_ on west-east line (that part is correct!).

The whole parallel comprises 24 timezones so one needs to calculate how many timezones are there. There are two hours of difference but they're, to put it mildly, not the center of their respective timezones, with Paris being rather close to London by longitude (real longitude of Paris is 2°20′, of Moscow is 37°37′); on the other hand, Moscow's time is off, it's so-called "decree time", with many countries at similar longitudes having UTC+2 rather than +3. So, 2 timezones' worth or a little more (real longitudes would mean about 2.5 timezones' worth as 37-2=35°, a bit less than one-tenth of 360°); 24/2=12; 30000/12=2500.

In short, several more-or-less-mistakes - or at least very crude approximations - cancelled each other to an extent that frightened me when I later looked up the real distance.

Expand full comment

Always funny when that happens. Estimating based on the timezone difference is a great idea though!

Expand full comment

Yeah, I ran into the same issue. I was pretty confident my first guess was within a good ballpark of correct. So when it asked me to assume my answer was off by "a non trivial amount" I definitely overcorrected, and my second answer was definitely worse. I knew I was probably overcorrecting, and that my second answer would probably be objectively wronger. But I didn't know how else to treat the question, other than to simulate what I would guess, as my second guess, in an alternate reality where my first guess had proved to be highly incorrect.

Expand full comment
founding

You gave a handful of examples where we could hypothetically benefit from the wisdom of crowds. But in each case, we *already* leverage the wisdom of crowds, albeit in an informal way.

E.g. my decision of academia vs industry is based not just on a vague personal feeling, but also aggregating the opinions of my friends and mentors, weighted by how much I trust them. True, the result is still a vague feeling, but somewhere under the hood that feeling is being driven by a weighted average of sorts.

I'm not sure there'd be much utility in formalizing and quantifying that--we'd probably only screw it up in the process (as you point out).

Expand full comment

Yep. This reads a lot like rediscovering the wheel. But I think there *is* value in quantifying and formalizing it. For example, the fact that the "inner" crowd is a lot less powerful than the "outer" crowd is very interesting, and points to some conclusions about psychology (and for that matter social psychology).

Expand full comment

I use wisdom of the crowds when I cut wood and I don't have my square; If I need a perpendicular line across the width of a piece, I'll just measure a constant from the nearest edge and draw a dozen or so markings along at that constant. They won't all line up (because I can't measure at a perfect right angle) but I just draw a line through the middle of them and more often than not it's square enough, because I'm off evenly either side of 90°.

Expand full comment

Measure twice, cut once!

Expand full comment

Cool! (Though I may be misinterpeting. I think this because, with my interpretation, it doesn't matter whether your angular errors are off evenly either side of 90; it's sufficient, for example, that they have the same distribution at any point along the nearest edge, even if that distribution is left or right-biased).

Expand full comment

With your last point, an important part of this is whether "wisdom of crowds" is a spooky phenomenon that comes from averaging numeric responses, or whether it's an outcome of individuals mostly having uncorrelated erroneous ideas and correlated correct ideas (so that the mistakes get washed out in the averages).

If it's the second, you'd expect that all sorts of informal and non-quantitative ways of aggregating beliefs should also work. If you want to know whether to go to academia or industry, you ask 10 friends for advice and notice if lots of them are all saying the same thing (both in terms of overall recommendation or in terms of making the same points). If you want to build a forecasting model, you can hire 10 smart analysts to work on it together.

Of course, the details matter--if you have people make a decision together, maybe you end up with groupthink because one person dominates the discussion, pulls everyone else to their point of view, and then becomes overconfident about their ideas because they're being echoed by a bunch of other people. If the "consensus information" and "individual errors" in people's thinking are fairly legible, on the other hand, you might do a lot better with discussion and consensus than with averages because people can actually identify and discard their erroneous assumptions by talking to other people.

Expand full comment

Isn't it uncontroversially the latter? People's opinions on the distance from Paris to Moscow are caused by lots of things, but mostly these can be divided into 1) the actual distance from Paris to Moscow and 2) random noise. The random noise gets cancelled out by the law of large numbers, which is all the wisdom of crowds really is. It's no spookier than the fact you can infer the probability of a dice landing on 4 by rolling the dice a lot of times.

It'd be interesting to try this where there's a common false signal, eg. if the question was to guess the numbers of bicycles in Beijing, and more people guessing would know the song than know whatever the actual number now is.

Expand full comment

Yes--I should have just said that instead of pretending I think it might be the spooky thing :). And I think the implication should be that any process that solicits feedback from many people and tries to incorporate all their perspectives will have advantages.

The common false signal point is really interesting. Similarly, something like "how deep is the ocean in leagues." This makes sense as a reason to, for instance, keep people off juries who have heard media accounts of the case, or to regulate the concentration of media companies.

Expand full comment
Feb 7, 2023·edited Feb 7, 2023

A long time ago I read a lawyer's observations on the jury system, and one point he addressed is why very smart people, strong-willed people, and people with powerful and related life experiences get removed from juries during voir dire.

He said it is not, as the cynics argue, because the lawyers want a bunch of emotionally immature and not very bright people that they can lead around to the desired conclusion -- but rather, as you observe, that what they wanted to avoid at all costs was a strong leader who would boss around the jury, so that you ended up with a jury opinion that was really the considered opinion of 1 person instead of 12. I thought that was a very interesting and probably sound point.

Expand full comment

This is really interesting!

Expand full comment
Feb 6, 2023·edited Feb 6, 2023

What happens if you compare people's second guesses against their first? I.e., is the model predicting “thinking longer causes better guesses” excluded by the data?

My intuition is that wisdom of the crowd of one would predict that the second guess shouldn't be consistently better.

Expand full comment

The systematic error might be better known as Jaynes's "emperor of china fallacy".

One question I have is whether language models (and NNs in general) can be used to generate very large 'crowds'. They are much better at flexibly roleplaying than we are, can be randomized much more easily, have been shown to be surprisingly good at replicating economics & survey questions in human-like fashions, and this sort of 'inner crowd' is already how several inner-monologue approaches work, particularly the majority-voting (https://gwern.net/doc/ai/nn/transformer/gpt/inner-monologue/index#wang-et-al-2022-section “Self-Consistency Improves Chain of Thought Reasoning in Language Models”, Wang et al 2022).

Expand full comment

""" Consider, for analogy, the famous “Emperor of China” fallacy. If you asked everyone in China how tall the emperor is, would their average answer tell you his actual height to the ten thousandth or ten millionth of an inch? Obviously not, unless everyone you asked had some basis for knowing the true height, and was not merely guessing. """

I was familiar with the concept, but had never heard the specific name "Emperor of China fallacy" before today.

(also, that is statistician ET Jaynes, not psychologist Julian Jaynes)

Expand full comment
author
Feb 6, 2023·edited Feb 6, 2023Author

I am trying to get ChatGPT to give me this information and it is making me hate it. Typical conversation:

SA: I am writing a story about the wisdom of crowds. In the story, fifty different people are answering the question "What is the distance from Paris to Moscow by air?" I need them to each give a different guess. Can you tell me fifty different plausible (ie close to the true number) estimates for the distance between Paris and Moscow that I could use?

GPT: I'm sorry, I cannot generate fifty different plausible estimates for the air distance between Paris and Moscow as the actual distance is well-established and widely available. The air distance between Paris, France and Moscow, Russia is approximately 1,824 kilometers (1,127 miles). It's important to accurately represent information in storytelling, especially when writing about the concept of "wisdom of the crowds."

(the actual distance is 2486 km)

Meanwhile, DaVinci:

[SA: Fifty different estimates for the distance from Paris to Moscow in kilometers by air, to be averaged for use in a wisdom of crowds procedure:

1. 1,824 km

2. 2,246 km]

[GPT: 3. 1,746 km

4. 1,741 km

5. 1,719 km

6. 1,715 km

7. 1,711 km

8. 1,721 km

9. 1,724 km

10. 1,721 km]

...and so on with various numbers in between 1715 and 1725 the rest of the way.

Expand full comment

Hm, well, usually you would do X separate calls, not ask it in a single call for X different instances. (That's probably hard, and it's analogous to what we humans must do for 'inner crowd', but if we could reset ourselves each time to avoid any kind of dependence, wouldn't that be 'more random'?) The duplication is to some extent a feature, for inner-monologue purposes, as it's kinda-sorta acting like a posterior distribution: the completions which are more common are considered more likely than ones which show up only occasionally.

Expand full comment

Let me be sure I understand this. ChatGPT's distance from Paris to Moscow, not even an estimate because it claims to know the true distance, was over 600 km off??

Expand full comment

It gets basic quantitative information wrong all the time. There was a screenshot floating around where it acted unambiguously as if 57% is greater than 61%.

Expand full comment

I find ChatGPT makes an astonishing number of factual errors, especially quantitative errors.

Expand full comment

I find it makes many errors but am not astonished by them, it's astonishing that it works at all.

Expand full comment

That's what you get when you optimize for passing the Turing test. :D

(Question: "How much is 5+3?" Answer: "5+3 is 8." Moderator: "Make your bets: human or bot?" Jury: "Suspiciously correct. I vote 'bot'.")

Expand full comment

Agreed, it was mystifying to me that, when I had ChatGPT creating villanelles, it *consistently* used 18-line forms. When I tried to prompt it to correct that, that it was missing the last line, it agreed that villanelles should be 19 lines and redid the last stanza to have 4 lines, but horribly broke the refrain patter. When I tried to help it correct that (numerous times), it invariably would agree with me that it had made a mistake, and then add additional lines to correct it. It was really weird...I expected it to get the metaphors and music aspects wrong (it did okayish on those), but to perfectly reflect the poem's form, which is objective. That really surprised me.

Expand full comment

It’s not good at counting - I find it useful to imagine it like a person who never counts and only uses their intuitive number sense. From a big picture, this *looks* like the right number of lines for a villanelle.

Expand full comment

It's terrible at meter in my experience. Its sonnets don't scan, and it will insist it's using iambic pentameter when it isn't at all.

Which admittedly makes it more human. Most people I know when writing verse hew to the "it doesn't matter how many syllables there are in a line or where they're stressed, as long as there's some sort of rhyme at the end" school of poetry.

Expand full comment

It's bad at syllables generally. Here's a dialogue I had with it:

Kenny Easwaran:

What are some of the longest English words with only one syllable?

ChatGPT:

There are several one-syllable words in English that are quite long. Some examples include:

Screeched: a past tense of the verb "screech," which means to make a high-pitched, harsh sound.

Streaked: a past tense of the verb "streak," which means to move quickly in a straight line.

Strengths: a noun that refers to the quality or state of being strong.

Screeching: a present participle of the verb "screech," which means making a high-pitched, harsh sound.

Streaking: a present participle of the verb "streak," which means moving quickly in a straight line.

Strengthen: a verb that means to make or become stronger.

Strengthening: a present participle of the verb "strengthen," which means making or becoming stronger.

Screecher: a noun that refers to a person or animal that makes a high-pitched, harsh sound.

Streaker: a noun that refers to a person who moves quickly in a straight line, especially as a prank or protest.

Strengthener: a noun that refers to something that strengthens or has the power to strengthen.

These are just a few examples of long, one-syllable words in English. There are many others as well.

Kenny Easwaran

I think some of those words are more than one syllable!

ChatGPT:

You are correct. I apologize for the mistake in my previous response. Some of the words I listed, such as "strengths," "strengthening," and "strengthener," do have more than one syllable.

Here are some examples of long, one-syllable words in English:

Screamed: a past tense of the verb "scream," which means to make a loud, sharp sound with the voice.

Leaped: a past tense of the verb "leap," which means to jump or spring suddenly and vigorously.

Slept: a past tense of the verb "sleep," which means to rest in a state of unconsciousness.

Crept: a past tense of the verb "creep," which means to move slowly and cautiously, especially on hands and knees.

Kept: a past tense of the verb "keep," which means to hold or retain possession of.

Swept: a past tense of the verb "sweep," which means to clean or clear an area by sweeping.

Slept: a past tense of the verb "sleep," which means to rest in a state of unconsciousness.

Wept: a past tense of the verb "weep," which means to shed tears or show emotion through crying.

Heft: a verb that means to lift and hold something, especially in order to judge its weight.

Left: a verb that means to go or move to the side or away from a central point.

I hope these examples are more accurate and helpful. Let me know if you have any other questions.

Expand full comment
Feb 7, 2023·edited Feb 7, 2023

I got it to work with the prompt:

"So let's say there's 20 people, Person 1 through Person 20, who are playing a game. Each one gives a plausible guess for the straight-line distance in kilometers between Paris, France and Moscow, Russia. Whoever gets closest wins. The game goes like this:

Person 1: "

And its response (edited for brevity) was:

"2,000 km

Person 2: 2,500 km

Person 3: 3,000 km

Person 4: 2,800 km

[...]

Person 19: 2,800 km

Person 20: 2,795 km

The closest guess is from Person 20 with a distance of 2,795 km. They win the game!"

That it didn't select the correct person to win aside, the average of its 20 numbers was 2782 km, for an error of 296 km.

Expand full comment
Feb 6, 2023·edited Feb 6, 2023

I remember hearing about a paper that described a similar effect for the weights itself: if you take a neural network and train it for a while, after some time it ends up reaching a "valley" in the gradient, where its loss function converges to a value it can't get below.

Except it turns out this valley is more like a plain or a sea (imagine a very wide flat surface with hills around it). You can pick any "point" in that sea and the loss value will be the same (the "sea level"). So what the neural network ends up doing when you keep training it is that it "walks on the beach" of that sea: its weights will keep changing in way that keeps the loss at "sea level" while keeping it at the edge of the sea gradient-wise.

Now that is interesting in itself, but the more interesting part is that you can actually take a few checkpoints of the model as it's "circling" that sea, average these checkpoints, get a model that's in the "middle" of the sea; not only will this model still give coherent answers with the same average loss, the model will actually generalize *better* to out-of-distribution inputs.

So taking a network and averaging it with "itself plus a little more training" can actually give you better results than the training alone. Man, machine learning is weird sometimes.

Expand full comment

I use the single-player mode a lot when I'm guessing what something will cost - and I use it on my wife too. I start with two numbers, one obviously too low and one obviously too high. I then ask:

Would it cost more than [low}?

Would it cost less than [high]?

Would it cost more than [low+$10]?

Would it cost less than [high-$10]?

. . . . and so on. You know you're getting close when the hesitance becomes more thoughtful.

I'm sure I'm not the only one who does this, but I believe that many of us do something similar in a less deliberate or structured way. If you've lived in in Europe, you probably have a good feel for the scale of the place and of one country relative to the next. You may even have travelled from Paris to Moscow. If you live in North America, you may zoom out and rotate a globe of the Earth in your mind's eye until you reach Europe, and then do some kind of scaling. Estimating by either method will almost certainly give a better result than a WAG most of the time. So your "very wrong" answers weren't necessarily from lizardmen, but were just WAGs rather than thoughtful estimates.

Expand full comment

You may not be the only one who does this, but you're the only one who does this without their wife screaming "Just ask me the damn question!"

Expand full comment

So much likes for this comment. Seriously all the likes, come on man where's a damn like button, I need to mash it

Expand full comment

I do this all the time when forecasting. E.g. For Paris to Moscow rather than immediately ask myself how confident I am that the 3,000 km my brain generated first was too high/low, I start with the extremes to set bounds on my guess, then narrow my confidence interval.

I'll pick two other random cities and recreate:

London-Berlin

90% CI = 500km-3000km

80% = 700km-2200km

60% = 900km-1800km

40% = 1050km-1500km

20% = 1150km-1400km

Guess: 1250km

(Actual was 932km)

Hong Kong - Singapore

800km-6000km

1k-5k

2k-4k

3300km guess.

(Actual was 2581 km)

Expand full comment

Or from people who are very, very bad at geography.

I certainly didn't get 40,000 km, and I only vaguely remember what I did get, but I know I used a "one time zone is about X miles" factoid from a conversation the previous week, then decided Eurasia was probably a bit more than a quarter of the globe, there are 24 time zones, do the math (and convert to km). I don't remember what I guessed precisely but I'm pretty sure it was wildly high. Either I misguessed what part of Russia Moscow was in (perfectly plausible) or my X number was for closer-to-the-equator than we were talking, or something else went wrong in the estimate - but I suspect there may be a bit of typical minding here from someone who can't believe anyone could be That Bad at geography, and nope, nope, someone totally could.

Now, I knew the number was probably wildly off, would never have relied on it for anything, and in real life would not even have bothered to generate it. I know I'm very bad at geography (partly no doubt because I'm very disinterested), and if this kind of thing comes up in real life I look it up or ask someone. But I figured taking guesses even from people who were very bad at it was part of the study design, so...

Expand full comment

I think the problem with this approach is anchoring. You've anchored starting at the first number, which was intentionally low. Now anything will sound high compared to that.

A friend was once trying to establish how much I could sell something for, and asked a third party, "Would you buy this for £5?" "Definitely." "Would you buy it for £10?" "Sure." "£15?" "Yeah, probably." "£20?" "I dunno. I guess." "£25?" "Maybe?" (It's something that would typically be in shops for £40-£50.)

Expand full comment
Feb 6, 2023·edited Feb 6, 2023

Post gets only a 7/10 enjoyment factor, I still don't know how far apart Paris and Moscow are in surface kilometres and am now forced to have to go look it up. Upon reflection that my personal enjoyment might have been wrong, I've revised my estimate to 5/10 and have now averaged this out to 6/10...or was it ....the square root of 5*7 or 35^(1/2) for an enjoyment of 5.92/10? I don't even know anymore!

Expand full comment

I took the instruction to assume that I was off by a significant amount seriously. I decided i thought i was more likely to be greatly underestimating than over estimating and so took my first estimate and x10. In other words, i really didn’t re-estimate from scratch at all. If this analysis was your intention all along, perhaps explaining your intentions would have gotten people to rethink it in a more straight forward way.

Expand full comment

I too took the instruction seriously. My second guess basically used the procedure "_given_ that I was super wrong, is it a lot more likely that I was super low or super high? ==> guess x2 or /2"

Expand full comment

This is both a great example for and a horrible case of the "wisdom of crowds" fallacy in forecasting - the problem isn't that your guessing at something known to a large part of the population approximately and so a larger sample more reliably gives you a median that is close to the ideal median of the entire population, which will be somewhere in the vicinity of the real thing because there is some decent penetration of the real value into the populace.

In forecasting you're guessing at something that isn't known to a large amount of the population, but the population and ergo your sample will have some basic superstitions on the issue that come mostly from mass media and social media and so even when you get a good measurement of the median, the prediction is still crap because you polled yourself an accurate representation of the superstition and not the real thing.

Say you want to know when Putin will end the Ukraine war - only Putin and a few select individuals know when that will be - if at all and this isn't made up on the go. But everybody will have some wild guesstimate, since newsperson A or blogger B or socialite Z (pun intended) posted some random ass-pull on twitter not necessarily claiming but certainly implying to know when it will happen. This is the result you're gonna get in your poll.

Wisdom of crowds is useless as forecasting and only works when the superstition has some bearing on the issue at hand, i.e. the policy itself is influential on public opinion or there is a strong feedback loop which ensures conformity of what's happening with the emotional state of "the masses". That, mostly, doesn't appear to be the case.

Expand full comment
author

I don't think it's that simple - basically nobody knows the real distance from Paris to Moscow, yet a crowd of 100 got within 8%. Nobody knows for sure what will happen in the future, but aggregates of forecasters (including superforecasters) consistently beat individuals.

I think of the Paris-Moscow problem as - everyone is going to have little tricks or analogies or heuristics they use to try to solve it (mine is "it feels like about half the distance across the US, and I know that's that's 5000 km). Those tricks contains signal and noise, the noise averages out over many people, and all that's left is the signal (unless there's some systematic reason for people to be wrong, eg map distortion).

I think this is the same with forecasting the future. Remember, people weren't being asked to guess whether Russia would invade Ukraine, they were being asked to give *their percent chance that would happen*. I think there is a true probability for that which a perfect reasoner would conclude using all publicly available evidence at the time (in the same way that the true probability for getting a 1 when you roll a dice is 16.6%, even though nobody knows how the dice will land). I think people's guesses are something like the true perfect-reasoner probability plus random error, and should average out to the true perfect-reasoner probability, unless there's systematic distortion. Here there might be - for example, propaganda, or status quo bias, or war being too terrifying to contemplate. But I would still expect the noise-removed version to be better than the noisy one.

Expand full comment

"nobody knows"? I knew roughly ( my guess was actually 1800km). And that's because 30 years ago i flew to Paris. once! And i didn't even need to remember that number and at that time they didn't have lcd displays everywhere in the cabin showing distance and time left.

There are a lot of people who fly regularly. many of them know their flight distances

Wisdom of crowds fallacy is same as in anecdote about dinosaur ( chance of meeting dinosaur is 50%. -to meet or not to meet). It ignores how much priors matters . Fact is priors matter more than anything else. In fact you might not need statistics at all, only good priors

Expand full comment
author

You were off by about 700 km, approximately the width of Germany. I'm not sure why you would call that "knowing roughly". I stick to my claim that this is something most people don't know, although they may have guesses, heuristics, and analogies.

Expand full comment

Dunno, I flew Moscow-Germany several times, been to Paris, too. I only once checked my flight distance - as it was overbooked and I needed to see which compensation-bracket I was in - and that was an Ukraine-Germany flight. Anyway my guess was at 3500km - and when I was asked to reconsider I was assuming it could only be considerably further out. Whatever, I doubt the wisdom of one-crow (sic) + conclude, this experiment did not validate the theory + I am excited to see a mainly US-crowd doing so well.

Expand full comment
Feb 7, 2023·edited Feb 7, 2023

Yeah but it seems very likely the accuracy of group estimates only beats experts when the outcome is quotidian, and this is not very interesting in terms of policy planning. What you want is something that predicts Black Swans better than experts, or random guessing, because those are the really important things for improving policy planning.

Id est, a prediction market that converged in 2015 on the conclusion that the date of a scary new pandemic out of China (a Black Swan) would be 2020 +/- 1 year would've been actually useful to planners. A prediction market that converges on 2025 +/- 1 as the year when COVID restrictions are largely gone is not very useful. That they will be gone within a few years is conventional wisdom already, and the precision of a crowdsourced estimate on the exact date is not likely to be that much greater than a random guess by an expert to be worth the cost of switching methodologies.

Expand full comment

>"...socialite Z (pun intended)..."

I feel like you're doing an inverted form of https://xkcd.com/559/ with this.

Expand full comment

This is something that I've been thinking about in the context of LLMs. Ask an LLM a question once, and you are sampling its distribution. Ask it the question 10 times, consider the mean and variance, and you have a much better sense of the LLM's actual state of knowledge.

Here is an LMTK script I wrote in Jan which demonstrates this in the context of math problems: https://github.com/veered/lmtk/blob/main/examples/scripts/math_problem.md

Expand full comment

I guess Walt Whitman was on to something when he wrote "I contain multitudes"!

Expand full comment

Is the data from the study saying that the average guess was many times larger that the actual answer? It seems that that might part of the reason why you got different error measurements. Guessing geographical distances has a limit on upper bounds in a way that guessing a number of objects doesn't.

Expand full comment
author

I think yes, the arithmetic average was way too big, because some people guessed a million or a billion, and it takes a lot of people guessing much-too-low things to cancel out.

In fact there's no limit on upper bounds for either - people gave estimates for the Paris-Moscow distance much much bigger than the circumference of the Earth. This is why I had to use geometric mean.

Expand full comment

That makes sense, since the error for most people should be roughly log-normal; this is just the central-limit theorem. (If you assume everyone's guesses are off because of many small errors adding together, and each error causes people to be off by a certain %). The geometric mean happens to be the sufficient statistic for the log-normal distribution.

Some of these sound like trolls, though, in which case the most extreme answers should be downweighted. The harmonic mean might work out to be a better estimator here.

(I assume you had to discard zeros.)

Expand full comment

Harmonic mean - ugh no. More obfuscation. Throwing math at the problem will not explain what is going on.

Expand full comment

Harmonic mean is for averaging rates; I don't think it applies here.

Expand full comment

This is perfect reason to look at the standard deviation and to use a histogram (to identify the wildly wrong guesses in order to determine whether there are outliers and whether there is a good reason to remove them.)

Fooling around with geometric mean should not be your first instinct.

Your entire approach to the problem seems off. Statistically and anthropologically.

Expand full comment

The geometric mean has good theoretical backing here, and in fact is the most principled way to take an average of many estimated distances, as estimated distances are going to follow an approximately log-normal distribution.

Looking at histograms and then randomly chopping off data you think is “Bad” is the approach that’s statistically off. It’s unprincipled and extremely open to ad-hoc manipulation.

Expand full comment

I didn't say randomly chop off outliers.

I carefully (or so I intended) wrote: to determine "whether there is a good reason to remove them." Whether and good reason being the key question.

We will have to disagree about jumping to geometric mean.

Expand full comment

Also, the standard deviation is not going to make any sense here. The standard deviation is *far* less robust to outliers than the sample mean.

Expand full comment

What?

Expand full comment

The standard deviation depends on higher-order moments (specifically, the square of the data), and as a result it’s much more unstable in the presence of outliers.

Expand full comment
Feb 7, 2023·edited Feb 7, 2023

No kidding.

1. You can't really detect outliers without looking at SD.

IF actual "outliers" don't exist then data is what it is.

IF actual "outliers" do exist, then on a non ad hoc basis we must decide whether to exclude or include the outliers.

2. The actual premise of WCT was to use the median. Galton notwithstanding racist instincts also had a majoritarian instinct.

What was the essay really about? And why does the essay fail as a serious inquiry.

Is WTC generally true or not? That requires a careful operational definition of what exactly do we mean by WTC? Why doesn't it seem to work here? - not how can we make it work by using some other ways to take a mean. That is using math as a kind of voodoo and there are plenty of people who think they have a "system" at the keno machine using math.

The second question proposed for investigation is can we bootstrap our own mind to create the conditions where WTC might seem to work?

But the premise, as I understand it, of WTC is the presence of a diversity of actors and talents and approaches. Unless one is arguing for a purposeful fragmentation of the mind, the proposition is very odd.

You can become a better guesser by understanding where you might have made errors. But that kind of self correction is not the premise of WTC.

Expand full comment

Yes, the AM was too large:

- The true answer is 2,486km.

- The arithmetic mean of all estimates is very bad. For the first estimates the AM is 7088km, for the second estimates it is 9331km, and for first and second estimates together it is 8210.

- The geometric mean of all answers is pretty good. For the first estimates the GM is 2,722, for the second estimates it is 2961, for first+second it is 2,839. That is only 9% / 19% / 14% from the truth.

Expand full comment

Doesn't Caplan's Myth of the Rational Voter deal with how the wisdom of the crowds only works when people aren't systematically biased on the subject in question?

Expand full comment

For those who were (like me) confused by what "geometric_mean[absolute_value(geometric_mean<$ANSWERX, $ANSWERY> - 2487)]" is supposed to mean, here's the ChatGPT explanation which makes sense:

This expression calculates the geometric mean of the absolute value of the difference between the geometric mean of two values ($ANSWERX, $ANSWERY) and 2487.

The geometric mean of two values is calculated by multiplying the two values and taking the square root of the result. So the expression "geometric_mean<$ANSWERX, $ANSWERY>" calculates the geometric mean of the two values.

The difference between this geometric mean and 2487 is then taken, and the absolute value of this difference is calculated, ensuring that the result is always positive.

Finally, the geometric mean of this absolute value is calculated, which gives a single value as the final result.

Expand full comment

> So is the percent chance that your country would win. If you could cut your error rate by 2/3 by using wisdom of crowds techniques with a crowd of ten, isn’t that really valuable?

Aren't people doing that all the time ? Governmental organizations have committees; large corporations have teams; some of them even hire smaller companies as contractors specifically to answer these types of questions.

Expand full comment

re: "What about larger crowds? I found that the crowd of all respondents, ie a 6924 person crowd, got higher error than the 100 person crowd (243 km). This doesn’t seem right to me..."

I'm also suspicious, and would predict this observation will reverse with enough resamples of the 100-person subsets. 6,000 instead of 60 would probably do it? Or maybe some outlier was simply missed.

There is likely a simple proof based on sum-of-squares decompositions that would show the average over all subsets of the 100-person group has higher error

Expand full comment

This looks like much ado about nothing.

Part of the crowd guesses high, the other part guesses low.

Nobody is right, but it averages out closer to right.

'Nuff said.

Expand full comment

The important point is that the extent to which part of the crowd guesses high and the extent to which the other part guesses low averages out to approximately the correct number.

Suppose I throw a 6-sided die and ask a crowd to guess which number I got. If we average their answer we're probably going to get something close to 3.5 *regardless of which number I actually got*. So in this case there is no wisdom to be gained by asking a crowd.

The fact that a crowd does converge on something close to the real answer when there is a real answer *is* impressive.

Expand full comment

Yes and no.

Again, I repeat that it is probable if all the "wisdom of crowds" is averaging out the outcomes - what it really indicates is that humans are pretty good at roughly estimating amounts of physical things.

A better test would be something which does not involve physical sums or this inherent capability. The counterpoint to this is the "man on the street" interviews on all sorts of non-everyday subjects like: How far is the moon from Earth?

I guarantee there would be zero "wisdom" in such an example.

Expand full comment

I think the spookiness of "inner crowds" improving your answers mostly comes from an intuition that whatever you were doing originally can be approximated as being an ideal reasoner. An ideal reasoner shouldn't be able to improve their answers by making multiple guesses.

But humans are often pretty far from being ideal reasoners. If this works, I see that more as an indictment of how bad humans are at numerical estimates, rather than a spooky oracle.

(Though this doesn't prevent it from being useful...)

Expand full comment

I wonder if the "inner crowd" effect is anything more than a way to get people to spend more time and effort thinking about the question. On the other hand in the survey people's second guesses were off by a little more than the first ones?

Expand full comment

My best guess is that it has something to do with counteracting anchoring bias (the tendency to stick too close to whatever number first pops into your head). The finding that inner crowds work better with a longer delay between guesses could be because it helps you forget your previous anchor.

Expand full comment

One hypothetical mechanism for why it works is that it forces the forecaster to make an estimate of their uncertainty and take a second draw from the implied distribution. It’s similar to when someone wants to “sleep on it”, even though they aren’t going to get any new information. They are just going to think about the worst case (and maybe best case) and get a second draw after thinking more about the distribution of results

Expand full comment

> I think the answer is something like: you can only use wisdom of crowds on numerical estimates, very few people (currently) make those decisions numerically, and the cost of making those decisions numerically is higher (for most people) than the benefit of using wisdom of crowds on them.

Actually, I think you're wrong on this one: wisdom of the crowds really is OP and we're severely under-using it.

An example that immediately comes to mind is peer programming: by having two people work on the same code simultaneously, you can immensely increase their productivity. Every time I've tried it, I had positive results, and yet most companies are *very* hostile to the idea.

The part about getting diminishing returns as you add more people is interesting too. I wonder if you could drastically reduce design-by-committee problems in an organizations by making sure all committees involved have at most three or four people in them.

Expand full comment

Maybe a non-spooky explanation is that when we do not know the exact answer to a question, we instead have a distribution of possible answers. When you force someone to collapse that wave function down to a single scaler measurement, they will randomly pick one possible answer as per the probably distribution. But you have lost all the rest of the information contained in the distribution. When you ask again, you make a 2nd sampling from the distribution, which adds precision. Note that if you keep asking you will get more points, but inevitably you still loose information.

Example, I might know that there is either $100 or $200 in my bank account becuase I don't know if a check cleared yet. If you force me to pick a single value I'll pick either at random. Ask me twice and 50% chance I'll pick the other. By your way of measuring it looks like I don't know much, which in fact I have complete information less a single bit.

Similar argument works for crowds as well.

Expand full comment

I also made an analysis of the inner crowd on the same survey question, using different statistics.

https://astralcodexten.substack.com/p/acx-survey-results-2022/comment/12089011

tl;dr: I got similar results as Scott: the inner crowd helps a bit, but not too much. Strangely, the second estimate was much worse than the first. Some speculated that this was due to Scott's phrasing "off by a non-trivial amount" in the second question, but the same effect (worse second estimate) was also in the literature, where probably they didn't have such a phrasing. (But my source was much less sophisticated than Scott's VD and VDA paper.)

Highlight numbers, GM stands for "geometric mean":

- The first estimate was off by a factor 1.815. (This means that the GM of all those factors was 1.815)

- The second estimate was off by a factor 1.901.

- The GM of the two estimates was off by a factor 1.791.

- How often was the first estimate better than the second: in 53.3% of the cases.

- How often was the GM better than the first estimate: in 52.8% of the cases.

- How often was the GM better than the second estimate: in 60.0% of the cases.

Expand full comment
author

Thanks, I forgot to link your analysis in the post but I'll put it on an Open Thread. Do you understand why we got different results on geometric mean? Maybe it's just the full vs. public dataset?

Expand full comment

You mean if we just compute the geometric mean of all estimates? I got 2,839, which is 14% off. You don't write yours, but it was only 8% off, right? I got almost the same when I only used answers to the first question, estimate 2,722, which is 8.7% off. Perhaps that is what you used as well?

If that's not it, then my only guess is that it's the different datasets. If any of us would have made a programming mistake, we would hardly get so good results. I only took participants which answered both questions, perhaps that shifted the distribution a bit?

Apart from that we computed slightly different things. For the mean deviation, you computed:

geometric_mean[absolute_value($ANSWER - 2487)], for all 6924 answers. This gives 918km, so 37% deviation.

I took the factors by which the answers deviate from the truth, and computed the GM of those:

geometric_mean[IF ($ANSWER > 2487) THEN ($ANSWER/2487) ELSE (2487/$ANSWER)], for all answers.

This gives a factor of 1.79, or 79% deviation from the truth.

Another way to calculate the same number, which is perhaps easier to understand: switch to log scale, compute the the difference from the truth, compute the AM, and then convert back (the base of the log does not matter, as long as it is consistent):

$LOGANSWER = Log_10($ANSWER)

$DIFFERENCE = absolute_value($LOGANSWER - Log_10(2487))

$AVERAGE = arithmetic_mean[$DIFFERENCE], for all answers

$FACTOR_OFF = 10^$AVERAGE_LOG

I think my approach is a bit more principled, because you mix an arithmetic operation (minus) with a geometric one (products from the GM). I only take products/quotients, or sum/difference in log-space, which is the same type of operation. In any case, it's not surprising that we got different results there.

Expand full comment

I agree that mixing geometric means and arithmetic distances (like absolute_value($ANSWER - 2487)) is possibly problematic. In particular, in the formula

geometric_mean[absolute_value($ANSWER - 2487)], for all 6924 answers

if just one participant guessed the correct answer 2487, then one of the factors in the geometric mean gives zero, so the entire geometric mean computes as zero. On the other hand, in the formula

geometric_mean[IF ($ANSWER > 2487) THEN ($ANSWER/2487) ELSE (2487/$ANSWER)], for all answers

if one participant guesses correctly, it just means one of the factors in the geometric mean is 1, and no such degenerate result occurs.

Expand full comment

In particular, 2500 was a nice round guess, and it's really close to the actual answer (giving a difference of 13 km). With geometric means, that might throw off things disproportionately.

Expand full comment

When asked to guess a number, my mental process is to first find a range, then pick (somewhat arbitrarily, honestly) within that range. I suspect that repeatedly sampling the same person is just a rough, inefficient way to find their range estimate.

I suggest trying a similar question but asking for the 70th percentile upper and lower bound on the distance (with another question asking if the person knows what that means as a filter).

Expand full comment

What if you tried bootstrapping the larger groups of individuals (i.e. sample with replacement)? I’m on vacation or I’d do it myself but I’d be curious on if that improves the error

Expand full comment

What do you mean "improves the error?" Bootstrapping doesn't help you estimate more accurately; the amount of data you have is fixed. It just lets you estimate confidence intervals on the error. The average of the bootstrapped samples will equal the sample average for the whole population.

Expand full comment

Ahh ok, got it. That’s exactly what I was asking

Expand full comment

I think you raise a really interesting question.

Expand full comment

This suggested to me that the 'internal crowd' was almost entirely worthless. "P < 0.001!" Yes, but magnitude <2% improvement? I have low confidence in a result like this one (even with a great p-value!) that purports to demonstrate a method for 1.5% improvement in guessing accuracy.

Expand full comment
author

See the Nature paper, where on their task a sufficiently large inner crowd can approximately halve error.

Expand full comment

Why do you think you weren't able to replicate their result? (I wouldn't count going from 2x to 1.5%, no matter the p-value, a 'replication').

Expand full comment

IIRC the reasoning for *why* the (outer) wisdom of crowds works, is that the crowd contains a few experts who will be biased in favor of the correct answer... while everyone else errs randomly above or below the correct answer. So there was no inner wisdom of crowds in this version.

Expand full comment
author

I don't think so. The Nature task used an "estimate the number of rocks in a giant bottle" type task. There aren't "experts" in the same sense where a cartographer might simply know the right answer to the Moscow question.

Also, given that experts form a consistent proportion, I don't think in this situation you would expect to see accuracy improve as a function of crowd size. If 10% of people are experts and 90% nonexperts, you should see the same balance in crowd size of 10 (1 expert, 9 nons) or 100 (10 experts, 90 nons). It won't be these exact numbers, but you won't in expectation see experts as a percent increase with increasing size.

Expand full comment

“Estimate the number of balls in this jar” and “Estimate the distance between Paris and Moscow” seem like qualitatively very different tasks to me.

Estimating the balls in the jar seems like a visual reasoning task, whereas estimating the distance seems like a preexisting knowledge task.

I didn’t know where Moscow is within Russia. I didn’t know how many countries were between France and Russia. I didn’t remember whether a kilometer was bigger or smaller than a mile. And I didn’t know any reference large distances to use for comparison except that the radius of the earth is 4000 mi. Therefore there were so many inferential steps in my distance guesses wherein to introduce additional error; as compared to my guess about balls in a jar, which seems to just be testing my skill at 1 thing.

Expand full comment
author

I think if I were to try to estimate the ball jar task, I would use some method like "it looks like there are about 100 balls on the top level, and the jar is about 50 balls deep, so 5000 balls", which doesn't seem that different to me from "looking at my internal world map, it looks like Paris and Moscow are about 10% of the Earth's circumference away, and I expect the Earth's circumference at that latitude to be about 20000 km."

Expand full comment

I don't see where you disagree. The quality of your internal map, the knowledge about Earth's size etc. all rely on outside information.

Expand full comment

I remember unfortunately ruining my results for this by immediately looking up the answer after putting in my guess for the first question (since I didn't know there was going to be a second).

Expand full comment

I did the same thing.

Expand full comment

Hi Scott; it's the inverse-square-root. The standard error of an estimate declines as a function of 1 / sqrt(n) for sample size n (because the variance declines with 1/n).

If the estimates are biased, the root-mean-square error is going to be sqrt(bias^2 + (variance / n)) for sample size n, i.e. the mean squared error will decline hyperbolically. This isn't something the study found; it's a mathematically-derived formula, which they then fit to the data to get estimates for bias^2 and variance. Because estimates taken from 1 person are going to be substantially biased, the error will never reach 0; it asymptotes out very quickly. The average of many people is going to be much less biased, such that the variance probably dominates.

Expand full comment

It seems moderately hard to *in general* defend the assumption that there's no fundamental minimum to the confidence (95% CI) that you should have in an estimate. It's more defensible as a mathematical result, but at some point it's false confidence.

For example, imagine that you ask a crowd to estimate the sum of 10 dice. If they don't suck at forecasting, you'll get a distribution centered around 35, but the amount of confidence that it's between, say, 33 and 37 should be limited, regardless of crowd size. So RMSE doesn't continue to decline, but your estimate won't reflect that. (Obviously, in this case you could fix it by individuals estimating the distribution, or similar - but when we estimate a quantity, we don't do that, and there are times that the confidence should be limited.)

Expand full comment

In traditional statistics, what's being estimated are parameters of statistical models -- the population mean, in this case -- and we can have tight confidence intervals about means without making assumptions about the associated variances.

In your die roll example, the quantity being estimated is the mean of the random variable that is the sum of rolls, which we can indeed have tight CI for. It's possible I'm reacting too literally to your example, and that there are better examples of the phenomenon you're interested in i haven't considered.

Expand full comment

"In traditional statistics, what [we are justified in trying to have] estimated are parameters of statistical models." FTFY!

And I agree that you can estimate means without making assumptions about variances, but you can also aggregate multiple elicited distributions, which is what we really would want - it's just that you don't necessarily get tightening CIs, and to do it properly, complex Bayesian issues arise.

Expand full comment

> The average of many people is going to be much less biased, such that the variance probably dominates.

It's not clear to me that this is true. Many people have read the same wrong articles on quantum computers (or whatever), thus their bias will be highly correlated.

Expand full comment

I calculated the best fit curve on n=1 to n=100 and got bias = 157, variance = 975,000. From eyeballing the residuals, it looks like a pretty good but not great fit:

- It overestimates the error for n=1 by 104

- It underestimates the error from n=2 to n=29 by a minimum of -1.4 and a maximum of -21

- It crosses zero twice more from n=30 to n=32

- It overestimates the error from n=33 to n=100 by a minimum of 0.50 and a maximum of 5.0

Expand full comment

I probably produced two of the very far outliers because of being very bad at geography and spatial reasoning generally. I think I put down a guess that was an order of magnitude wrong, and then, being told by the second question to answer as though my first was wrong, changed my answer by an order of magnitude in the wrong direction. I don't if this information is helpful to anybody; but some of us don't realize we're being lizardmen because we have no idea how to meaningfully connect the ideas "kilometer" "Paris" and "Moscow". 1,000 km seems as reasonable to me as 200,000 km.

Expand full comment

I also have poor intuition about these kinds of questions so maybe you'll find my reasoning helpful. I knew that meter used to be defined as 1/40,000 of earth circumference (well, technically 1/20,000 of a certain meridian, whatever). And I also remembered that Moscow is 1 hour ahead of Kyiv*, which is 1 hour ahead of most of EU. So a rough estimate would be 2/24 of 40,000km, or about 3,333 km. That's within a factor of 2 of the correct answer.

[*] Actually this is only true in winter. Oops.

Expand full comment

Nit: it's the km that was 1/40,000th of the polar circumference.

Expand full comment

Oops. You're right of course.

Expand full comment

I'd be curious to see if those with dissociative identity disorder (or those who self-identify as systems, since that's probably more common than an official diagnosis) are better than the rest of us at this internal wisdom of the crowds.

Expand full comment

proposal for improvement:

- right before asking the first time, ask people to provide the last 3 digits of their zip code, or any other essentially random number

- preface the second question with an explanation of anchoring and ask people to provide a new estimate without referring to their previous one.

Benefits:

- providing a plausible reason for people to give a new estimate without inducing too much distortion

- measuring how much anchoring affects ACX readers

- measuring how much "inner crowd" can counteract anchoring.

Expand full comment

This hits on why I don’t see fast AI takeoff being a thing. GPT is wisdom of the crowds. A bunch of text is averaged together and gets you an answer that is directionally correct (as far as text completion goes) but is only going to asymptotically approach reality.

To “know” facts you need a different methodology, that is essentially brute force. How do you know the distance? You looked it up from a reputable source, which is reputable thanks to a reputation that took thousands to million of person hours to cultivate, and on top of that someone had to actually physically go and measure (or just wait until we launch satellites that account for general relativity into space and compute it from their data.)

Wisdom of the crowd works because it is actually very very hard to obtain real knowledge, but we think it is easy because we have a superficial experience of “knowing” many different things. Averaging a bunch of estimates allows more real knowledge to contribute.

All this gives me a low prior on AI takeoff even being a thing. We will burn out on modelling existing human knowledge and then begin the hard work of developing machines that can do the hard and painstaking work of actually gaining new knowledge. It will not be fast because knowing things is really a lot of work. Those 10^46 simulated humans will probably get bored and want to do something easier.

Expand full comment
author

I agree there's a chance of AI plateauing near human level, but I think it's less clear than you think.

AI already predicts text (eg guesses the next word in a sentence) much better than any human can. We don't notice this because we think a language model's job is "saying useful, coherent-sounding things", but this is not what language models are trying to do. They are trying to predict text.

It's not clear how to map this text-prediction task to the tasks we care about, but it's not obvious that the stuff we care about should plateau at exactly the human level.

Even if it did, which human level? An AI as smart as the smartest geniuses, but able to run 1000x faster and on as many instances as it wanted, would probably be smart enough for something takeoff-like.

Expand full comment

I fully expect AI to get more done than humans in the future, and even above our level. I am more faced with how difficult it is to learn even a basic fact like the distance between two cities.

Even knowing the number, you and I don’t actually “know” it in the same way someone who had walked the distance would know it. What use is just the number to us?

AI that actually knows things in that sense will have a lot of work to do.

It’s not clear to me that an AI can think 1000s of times faster. Our mental hardware is slower but also faster than a modern computer. Our compute is embedded adjacent to storage and is locally pretty fast with the synthesis being the slow part. Do that with modern hardware and it will slow down dramatically, and we don’t have a scale much lower than nm for much faster hardware to make up the difference.

Even supposing we do, you haven’t changed what is easily computable. P probably still isn’t NP. Brute force will quickly hit walls in increasing intelligence.

Sure, there could be a breakthrough in any of those things that changes everything. There could be warp drive. The ML development of the last decade does not look anything like warp drive, it’s just the SpaceX of ML (still very cool, but not something we’ve ever doubted, on a technological level, that computers could do.)

Expand full comment

How would the AI plateau at the level of human genius if it's being trained on the Internet? You could certainly train it to reproduce only the output of human geniuses, but then you might as well ask your panel of geniuses what they think, since you've gone to the trouble to identify and recruit them.

Expand full comment

Only now do I actually look up the distance from Paris to Moscow, and holy cow I was almost right on the money. My first guess was 2500 km

Expand full comment

My initial impulse is to ask for control! What happens if you pick a random number in a given range to guess (say, for Paris to Moscow the range would be something like 50 to 50,000 km, and yes I know that no two points on earth's surface are separated by more than 20,000 km, but some of your readers might now know it), then take a random distribution on the log scale, then pick two random samples? Would the "wisdom of crowds" effect be random chance?

Expand full comment

I've also found the "wisdom of the random duo" effect in my research (https://braff.co/advice/f/forecasting-masterclass-7-find-the-martha-to-your-snoop). I wonder if you or I could simulate the inner crowd by looking at forecasts on props that are highly correlated within the same contest? You have a bunch of Ukraine props where the average-across-3-props for a given forecaster may be a more accurate read on the whole battle than any one forecast?

Expand full comment

I also have poor intuition about this problem. However, when I got to the second question about the distance from Paris to Moscow, essentially asking me if I wanted to change my first guess, Monte Hall came immediately to mind.

Is there a logical comparison between this and the Monte Hall problem? Did anyone else think this? Should I look up some old Marilyn vos Savant posts?

Expand full comment

The obvious difference is that in the Monte Hall problem, the moderator gives the candidate actual information. In the Moscow question, the additional information is fake (we're supposed to assume we're wrong, even if we aren't).

Of course, in a real application where the actual solution is not known, giving meaningful feedback is tricky. You could elicit a second estimate from candidates if their first estimate is way off what everyone else is saying, but would that run the risk of converging to a wrong consensus based on what people guessed in the first round? I don't know.

Expand full comment

True. The similarity in the form of the problems struck me: 1) make your first guess, 2) get presented with additional information (right, wrong, other), 3) refine your guess. I think in both cases, the refined guess was often an improvement as long as the second guess was different from the first.

Expand full comment

If people's first answer was generally closer than their second answer, then means that it'd probably be best to take a weighted average that puts more weight on the first answer than the second.

Expand full comment

In the public data, I count 6302 people who made two different guesses (not equal to each other), and didn't get it right on the first guess.

Of the 2627 whose first guess was too small, 69% updated upwards on their second guess.

Of the 3750 whose first guess was too large, 51% updated upwards on their second guess.

So that's a 57% success rate at the implicit question "was your original guess too small or too large?"

Even though there was a tendency for people's first guess to be too big (59%), and a tendency to update upwards (58%), people still did better than chance at updating in the right direction.

(There were also 2 people whose first guess was exactly correct, and 75 who made the same guess twice.)

If I limit this all to people whose first guess was within a factor of 2 of the correct answer (above 1243 and below 4972), that leaves 4003 people.

67% of those whose first guess was too small updated upwards.

53% of those whose first guess was too large updated upwards.

On the whole, 55% updated in the right direction.

So, similar results even if limited to people who were in the right ballpark (within a factor of two).

Expand full comment

Some examples of how I use wisdom of crowds:

In games, like Codenames or Wavelength, people on my team independently come up with their guesses before we share and discuss them with each other.

In forecasting, I consider what range of forecasts I might plausibly make and average them. I also make multiple forecasts using different methods (e.g. using two different relevant reference classes) and then average them, to make use of information from independent sources. I also consult others' forecasts on the question when available to aggregate their views.

In general, when a group is collaboratively seeking the truth on a topic or trying to make a decision, I encourage giving everyone time to think of their own independent impression before having individuals share their view.

Expand full comment

Not sure what an "inner crowd" is.

Expand full comment

Just came here to say when I answered the distance question in the survey, I was SO off. I had no concept of the size of the earth so no idea what a reasonable distance would be. I can't remember now in which direction,but I was off by a whole order of magnitude. So yeah, probably one of the outliers. Just to put it out there that we're not all lizardmen, some of us just don't have a good model of these distances.

Expand full comment

I find this really bizarre. I thought the basis for the wisdom of crowds was Condorcet's Jury Theorem: assume (Independence) that individual voters have independent probabilities of voting for the correct alternative. Also assume (Competence) that these probabilities exceed ½ for each voter. It follows that as the size of the group of voters increases, the probability of a correct majority increases and tends to one (infallibility) in the limit. Suppose the number of voters = 1. While the single voter could make multiple guesses, how would that not violate the independence condition?

Expand full comment

Is it possible that since most of your readers are American, they had some idea in miles, and many just gave that same guess in km due to unfamiliarity with the conversion? The mean guess in km and changing the units to miles would be a lot closer to the true answer.

Expand full comment

Wisdom of the crowds is like ensemble learning for humans. Or maybe ensemble learning is wisdom of the crowds for machine learning models.

Expand full comment

Thinking of the mechanism behind the "crowd of one" effect. At first I thought it's a variant of the Monty Hall effect - first guess under complete uncertainty, second guess somewhere else in the spectrum, with some uncertainty removed. But more likely it is, combining sources of incomplete information. People will have different hypotheses or heuristics in mind to make a guess. They will only use one heuristic for the first guess. They will use a different one for the second guess. So now there is more information present than with a single guess. Example, if a person is completely uncertain about Paris-Moscow, they first might use the heuristic of "Russia is huge", then the heuristic "but Europe is small". The average of both biases produces a better result

Expand full comment

I've most recently used this trick to estimate how much wine I need for an event with a few dozen people.

I came up with an estimate using different mental model:

- One where everyone is thirsty and drinking wine (upper bound)

- One where people are hardly drinking any wine (lower bound)

- 2-3 more best guesses using different formulas

The numbers came out as: Upper > best guesses > lower.

So I felt pretty confident about how many cases of wine to order by averaging the best guesses and then adding some.

Expand full comment

>>> What about in finance, where people often make numerical estimates (eg what a stock will be worth a year from now)? Maybe they have advanced models calculating that, and averaging their advanced models with worse models or people’s vague impressions would be worse than just trusting their most advanced model, in a way that’s not true of an individual trusting their first best guess?

In fact this is standard practice in finance and most other ML applications, see https://en.wikipedia.org/wiki/Ensemble_learning , and is known to be one of the few methods systematically resulting in better predictions (another is increasing the dataset size). Multiple different models are typically created using different sources of information, underlying architectures, training techniques, etc, which are then "averaged" to make the final predictions. The models are usually as advanced as possible (i.e. they are a crowd of experts), and the averaging is typically also learned (i.e. instead of choosing between arithmetic and geometric means, you would learn the actual ensambling function to better account for each of the model's biases, ideally making use of their self-reported uncertainty). I doubt there's any big financial trading firm that does not have this in place, including the presence of multiple uncomunicated teams working on various models for the same purpose, each of them without access to the other models or final ensamble.

Expand full comment

I'll add to this that big hedge funds can have different teams covering different aspects of a stock - e.g. you might have a team covering tech stocks and a team covering large cap stocks, and they'll both send signals about Apple to the central aggregation team, which will put them through some algorithm to join them up.

Expand full comment

I have heard 'Wisdom of the crowds' described very differently, when you get large groups of people a small number will have specialized knowledge of the question, and a larger number will have general knowledge. If the wildly ignorant are simply guessing then their errors will frequently (but not always) cancel each other out and what you are left with pushing the data are the experts. You aren't averaging a bunch of guesses, you are asking enough people to find someone who knows the answer and then averaging out all the bad guesses.

Expand full comment
Feb 7, 2023·edited Feb 7, 2023

If the general-knowledge errors cancel out when you average them, then you don’t need the experts. But if the general-knowledge errors don’t cancel out, then a small proportion of experts won’t help much.

Expand full comment

Your envisioning a situation where the answer is 55, and one idiot guesses 100 and another guesses 10. When people with knowledge are asked a question their answers should converge close together, if you mix in a whole bunch of noise that signal can still shine through. If you have 100 people and 1 person knows the answer perfectly and 9 people know enough to get close and 90 people are guessing then it isn't that the average of the 90 guesses is going to nail the number, its that their guesses just have to neutralize each other in a way that the cluster of the 10 who know something acts as a strong pull toward the correct answer.

Expand full comment

But guesses don’t “neutralise each other”. If the average guess of the people guessing is 50 in your example, then the average guess overall will be 50.5.

Expand full comment

Perhaps quite tangential, but this has me thinking about how criticism and ratings can help us use “the wisdom of crowds” to predict things that aren’t objective in any real sense.

For example a movie’s quality is pretty arguably entirely subjective, and whether any one person will like a given film is hard to predict, but we all commonly use the wisdom of crowds to estimate a film’s quality and help us predict if it will be with our time or not.

Each person who rates a film is “guessing” the film’s objective quality, since no one person actually gets to claim that objective perspective. But if we add up enough subjective guesses, we can kind of approximate some kind of “objective” value.

I think there are probably a lot of ways we use a kind of vague sense of what “the wisdom of the crowd” Is about certain issues to help us make judgment calls.

Expand full comment

That’s an interesting idea, but I’m struggling to come up with a good use case. Most of the “subjective” things I can think of are multidimensional, and a straight average wouldn’t help much. You need to pick how you’re trading off between a film most people think is pretty good, and a film a smaller number of people think is great. However you pick, your answer is objective, but it doesn’t capture everything you want to know about whether the film is good. In a way, people already do this wisdom-of-crowds based ranking of films by looking at box office results. That can generally tell you if a film is popular, but by itself it doesn’t tell you if it’s good.

Expand full comment
Feb 7, 2023·edited Feb 7, 2023

"but I’m struggling to come up with a good use case."

Restaurant ratings on Google maps?

Edit to elaborate: the taste of food is famously subjective. However, would you prefer to eat at a restaurant where 90% of patrons say it's tasty, or one where only 30% say it tastes good? From a stochastic perspective, without further information, you are more likely to be part of the 90% crowd than the 30%.

Expand full comment

I guess that’s a good point. If you start with no information, then one piece of information is better. This would have quite a low ceiling on its precision, though. I wouldn’t have high confidence that a 4.8 star restaurant was better than a 4.7 star restaurant, even if both of them had tens of thousands of reviews.

Expand full comment

>What about in finance, where people often make numerical estimates (eg what a stock will be worth a year from now)?

Isn't the price of the stock already, in a meaningful way, already the wisdom-of-crowds answer to something like that question?

Expand full comment

> This looks like some specific elegant curve, but which one? A real statistician would be able to give a good answer to this question.

Under the simplest hypotheses, it should be the sum in quadrature (i.e., a ⊕ b = √(a² + b²)) of a s.c. "statistical uncertainty" proportional to 1/√n and a s.c. "systematic uncertainty" which stays constant.

Expand full comment

Regarding the answers for the Paris - Moscow distance. I think it's hilarious how you were surprised at some very wrong answers and assumed the reason is lizardmen/trolls. You're really just underestimating how bad some people are regarding distances and geography. I tried hard to give a good estimate but ended up with what is essentially a random number that could have been the distance to the moon for all I know.

Expand full comment

Note that the error can never go completely to zero for the infinite crowd. There should be a lower bound on persistent error set merely by the resolution of typical maps - plus an additional contribution from people's natural tendency to round large numbers. Sorry if this comes across as too pedantic, but I think generally these limits set by resolution are interesting and often neglected!

Expand full comment
Feb 7, 2023·edited Feb 7, 2023

I think on non-numerical things, we already instinctively use wisdom of the crowds. You feel vaguely positive about academia *because* you’ve heard people say more good stuff than bad stuff about academia. Our brains are very good at subconsciously "averaging" status signals, perceived utils, etc, but not so good at averaging actual numbers, so it’s only once we start putting numbers on things that we have to remember to do the averaging step explicitly.

Expand full comment
Feb 7, 2023·edited Feb 7, 2023

(This is Eric; I helped run the 2022 forecasting contest.)

I've thought a lot about this -- indeed, the first paper I wrote in grad school can be summarized as "the wisdom of crowds is a *mathematical* fact" (if you aggregate forecasters in a way that accords with how you score them). I'm planning to write a blog post about this, but let me briefly illustrate what's going on in this comment.

Suppose you put 100 candies in a jar and ask people to estimate how many candies there are. You're then going to score each person based on how far off they were, and compare two quantities: the average of everyone's scores, versus the score of the average of all the estimates (the latter is the wisdom of the crowd).

We're gonna score each participant based on the *square* of the distance to the right answer. (Why the square? Briefly, this choice incentivizes each participant to truthfully report how many candies they expect are in the jar.)

Let's say that the estimates are 90, 100, 110, 120, and 130: so, the participants disagree with each other but are also somewhat biased upward.

From first to last, the (squared) errors of the five participants are 100, 0, 100, 400, and 900, for an average of 300. By contrast, the average of all five estimates is 110, which is only off by 100.

In fact, it is *always* the case that the second number (error of the average) will be smaller than the first (average error), no matter which numbers I chose for my example. An intuition you could have is that the first number is equal to the second number, *plus noise*, where the "noise" is the variance in the participants' estimates. (Check it out: the (population) variance of {90, 100, 110, 120, 130} is 200, which is equal to 300 - 100!)

(Feel free to skip this aside, but: what's the math behind this? Briefly, let X be the random variable equal to the signed error of a randomly chosen expert -- so in our example, X would take on the values -10, 0, 10, 20, and 30 with equal probability. Then the average error is E[X^2], whereas the error of the average estimate is E[X]^2. The former quantity is larger, and the difference is E[X^2] - E[X]^2, which is the variance of X.)

The math here is sensitive to the fact that I chose squared error (and to the fact that I chose to aggregate estimates by averaging them). If -- as Scott did -- you take the *absolute value* of error instead of the squared error, it's no longer *mathematically* true. However, I would bet that it's empirically true a large fraction of the time. That's because if some participants underestimate the quantity and others overestimate it, they both count positively toward the average error, but the *cancel each other out* when you look at the average.

As for whether your error will go to zero as the crowd size goes to infinity: no. This is only true under a really strong assumption, which is that the crowd is *unbiased*. So for example, if in my example you have a huge crowd but they're systematically biased so their estimates are centered at 110 instead of 100, then in the limit of an infinite crowd you're still going to be off (your average will be 110).

And -- last point -- regarding making multiple estimates on your own and averaging them: it's definitely an interesting frame, but I'd say that you've reinvented the art of *thinking longer about the problem* :)

Here's what I mean: suppose you're weighing going into grad school versus getting a tech job. You think for a while, and you realize: "I'll be 9/10 happy with my pay at the tech job, but only 5/10 happy with my grad school pay." Then you think longer and realize: "I'll be 8/10 happy with the sorts of problems I'll be thinking about in grad school, but only 6/10 happy with the sorts of problems I'll be thinking about in tech." Then you think longer and realize: "the weather at the grad school I'm considering is 3/10, while the weather in the Bay Area tech job is 9/10". And so on. If you wanted to, you could think of each of these things (pay; intellectual interestingness; weather) as separate estimates. And then you can be like "wow, my decision will be more accurate if I average all my estimates together than if I make my decision based on a single factor!" -- I think that's basically all that's going on with the "wisdom of the crowds" here.

Expand full comment
Feb 7, 2023·edited Feb 7, 2023

Stupid question: hasn't this topic been done to death in statistics? I'm not an expert, but from what I remember, yes, you can combine lots of inaccurate predictors into a more accurate predictor - provided the individual predictors are unbiased, i.e., they don't systematically over- or underestimate.

My gut feeling is that this is the hard part - finding a dozen people knowledgeable enough to give a meaningful estimate is doable. Finding a dozen people who are not all influenced by the same sources of information to be overly optimistic or pessimistic is the hard part, and if you don't, you converge with great confidence on an inaccurate answer.

Edit: should have read Unexpected Values' answer above before I wrote this...

Expand full comment
Feb 7, 2023·edited Feb 7, 2023

There is a UK quiz show called "Who Wants to be a Millionaire" in which an individual is selected from a dozen or so competitors by their correctnesss and speed in answering a preliminary question, such as "Put such and such into alphabetical order" and is then asked a series of questions by the host. Each question has four possible answers, shown to the contestant, and one of these is the correct answer.

The contestant starts with three so-called "lifelines", which they can use once each for any question whose answer they are unsure of or don't know: "50 50" (which halves the number of alternative answers), "Phone a friend", and "Ask the Audience".

The "Ask the Audience" lifeline is the most relevant to this discussion. When it is invoked, each audience member selects on a key pad the answer they know or guess is correct, and the contestant is then shown a bar chart of the percentage of selections of each remaining possible answer.

For a commonly known answer, to a question relating to sport or soap operas for example, the "Ask the audience" lifeline is usually fairly conclusive, and one of their choices obviously predominates and turns out to be the correct answer. But sometimes a majority, occasionally spectacularly so, chooses the wrong answer!

It is interesting to speculate why so many people would choose the same wrong answer, presumably guessed. From my observation of several examples, the main reason for this is that they are biased toward a name they have heard of, or association familiar to them, among others they have not.

I have also observed that another source of audience bias obviously occurs when a contestant is unsure of the answer and the host asks them, before they commit to a choice, which answer they think is correct. It seems very foolish for a contestant to divulge a guess in that situation and then go on to use the "Ask the Audience" lifeline, as they will have influenced equally unsure audience members in advance, but many do!

Expand full comment

There's a US version of the show, quickly adapted from the British version in the late 1990s. It was a genuine prime time phenomenon for a few years a couple decades ago, and continued to run with a lower profile till pretty recently.

There were also versions for a while in the Disney parks where guests played for much lower stakes-- top prize was a cruise. I managed to get on one of those, and while I didn't quite make it that far, I did well enough that I had a day or two of micro-fame as people would come up to me in the parks to say they'd seen me play. Partly because I'd gotten some questions that were obscure to a general audience but had to use a lifeline for a question about a wildly popular sitcom.

(Since of course you wouldn't have prearranged a friend, the equivalent was "Ask a stranger a Disney employee flags down outside the venue." When I tried to use it for that question, I suspected from the accent of my new "friend" that they weren't any more familiar with popular American sitcoms than I was, and so it proved. Fortunately I guessed right anyway. :-) )

Expand full comment

I just asked my partner: First answer was 2000 km. Second answer was 1500 km.

In that case the error got bigger. Could there be a failure mode for this technique, that while on average it may make you more correct, there are doom cases where it makes you catastrophically more wrong?

Expand full comment

>Since we only have one datapoint for the n = 6924 crowd size, it’s not significant and we should throw it out.

I have no background in statistics, but that seems wrong to me. Is that the only rationale for throwing it out? On average, it should be at least a good a crowd like any other, and according to the theory (larger crowd = better), it should be the most representative of the average participant's wisdom.

Also, how did you come up with the crowd size of 100 for doing the analysis? If you tried different crowd sizes, were the results different on average? Did you try a sample of random crowd sizes?

Expand full comment

Also, if this crowd of 6924 did worse than an average crowd of 6924 (in whatever sense—across the population?), then the average a 100-person sub-crowd of it should also do worse than the average 100-person crowd in general, shouldn't it?

Expand full comment

My objection is, when you are already doing several levels of averaging (first you average participants in a crowd, and then you average sub-samples of that crowd), what is the reason for throwing out one specific sample? If that one sample is also the one that should be the most representative, according to the theory, it just smells of ignoring it *because* it doesn't fit the theory.

Expand full comment

This is similar to something I sometimes have to do for my job. We need to get estimates from experts on quantities of interest. There are various techniques you can employ to get them to give unbiased answers. The simplest and most useful one is after you ask for their best guess you ask "is it more likely that the true answer is above or below your guess?"

It's a technique I employ in my own decision making too, and from the comments it seems that lots of other people do.

Expand full comment

> As mentioned above, the average respondent was off by 918 km on their first guess. They were off by 967 km on their second guess.

Was the second guess (on average) higher than the first? Estimating a distance has this asymmetry where there are a finite number of ways to undershoot, but no limit to how far you can overshoot.

If so, maybe the you-are-the-crowd hypothesis has a better shot at holding true in something like betting the point differential in a game?

Expand full comment

You should check out 'Noise' by Kahneman, Sibony and Sunstein. It's a whole book about this stuff. They discuss lots of experiments on the wisdom crowds including a crowd of 1. Especially interesting is when they address real world applications - in sentencing, insurance, executive search and more.

Expand full comment

Isn't the wisdom of the crowd sort of the whole idea of democracy?

Assuming everyone makes some kind of internal estimate of how good/bad each candidate's policies are, a fair election should spit out the best option according to the median estimate. It's a lossy compression - we lose the numbers themselves and skip straight to the decision, and I'm not sure how well crowd wisdom works with the median, but I think our systems do *try* to apply this principle more than we give them credit for.

Expand full comment

I think the *idea* of democracy is more qualitative than that.

As for the empirical efficacy of actual *elections*, there's some bad news (especially for first-past-the-post (FPTP), the most familiar form of an individual election, at least for Americans): https://en.wikipedia.org/wiki/Arrow's_impossibility_theorem

Expand full comment

Love the math! I was a stat addict in college so this post struck the right chord.

Expand full comment

Writing a diary is the traditional way to access the wisdom of the crowd of your past selves.

(Though I admit I have never seen a diary that would end every entry with: "My today's estimate of the distance from Paris to Moscow is 1234 km.")

Expand full comment

I think Napoleon’s might have.

Expand full comment

ROFL

(sorry for low-content comment, but that's what you get when you remove likes)

Expand full comment

I'm a bit late to this, and I haven't read all the comments, so it's possible someone else mentioned this, but it seems like "wisdom of crowds" becomes less useful for highly subjective future predictions that do not involve estimations of objective concrete facts which are not influenced by future decisions made after making estimates. If wisdom-of-crowds predictions are made about things over which our decisions have influence, the merely knowledge of the wisdom of crowds "answer" for a question influences our future decisions about things, rendering the prediction unreliable, because the learning of the prediction changes the likelihood of the outcome (making it either more or less likely).

Expand full comment

I don't know if anybody else did this, but when I guessed the second time, I imagined how I would guess if I knew I was off significantly with my first guess. So it wasn't a clean guess. I guessed 5 or 10 times my first guess because I was imagining how I'd react if someone told me my first guess was way off... If that makes sense.

Expand full comment
Feb 7, 2023·edited Feb 7, 2023

1. Use and share a histogram.

2. Mean and standard deviation

3. Geometric mean vs. Arithmetic mean - why are you fooling around with this. Is this purposeful obscurantism?

4. parrhesia - start over and rewrite this essay.

Expand full comment

I'm am not sure you even precisely are stating the wisdom of the crowds theory, when you write: "Ask one person to guess how much a cow weighs, and they’ll be off by some amount. Ask a hundred people and take the average of their answers, and you’ll be off by less."

This isn't really correct. Galton's first observation was about the median not the mean. And Kasparov beats the World. And the crowd guess will not be better than any individual's guess. It will be better than the "majority" of the guesses. In addition, WCT presumes a diverse crowd.

I feel like this whole essay was half-backed. And you didn't really think carefully about the question from the start.

You also sort of wanted to write about bootstraping. But there was a lot of hand waiving about the theoretical and practical issues surrounding that process.

Lastly, because this whole article bothered me a lot: self correction.

Have you ever explored the Funnel Experiment found in Deming's The New Economics?

Expand full comment

> 3. Geometric mean vs. Arithmetic mean - why are you fooling around with this. Is this purposeful obscurantism?

No, it's just necessary because the number of kilometers from Paris to Moscow is an unbounded number and therefore if you don't use geometric mean, one guy guessing seven million light years effectively destroys the data. (He still does with geomean, but less so, so the more reasonable numbers win out.)

Expand full comment

1. Median! is what a Galton first noticed at the county fair.

2. Why should we even include 7 million light years guess?

3. The question presented was really why doesn't WTC seem to work here? The answer is not let's figure out a way to make it work!

Expand full comment

2: If it were literally 7 million light-years, sure, exclude it, but even with a cut-off applied (he rejected the data greate than the circumference of the Earth, a piece of extra real-world data that wouldn't be available in the case of many other questions) he still got much too high an arithmetic mean. Using a smaller cutoff would be even more cheating; of course you can get a good answer by rejecting all the guesses that are too far away from the true answer. The question is how to do this from the data of the guesses themselves, which is essentially what using a different sort of average (like the geometric mean or, as you say, the median) does.

Apart from that, there's a good reason to apply the geometric mean specifically in this case rather than other averages that might de-emphasise higher values. This kind of estimate can be roughly modeled as having multiplicative errors rather than additive ones. Maybe you take the circumference of the Earth and the angle between Paris and Moscow and multiply. Maybe you guess about how long it would take to drive and use speed and an estimate of road windiness. Maybe you've seen it on a map next to two cities where you know the right answer and you try to scale it. Maybe you misplace a decimal point and go off by a factor of ten. For multiplicative errors, a multiplicative average makes sense to get rid of them. In the ideal case, this process leads to a log-normal distribution, in which the median and geometric mean coincide at the true underlying value.

Expand full comment

None of this has to do with the actual questions that should have been raised by the essay.

How do we operationally define "the wisdom of the crowds"?

What are the assumptions, caveats and theoretical underpinnings of WTC?

Scott thought, apparently, that WTC had to do with average when it seems to usually be defined in terms of median.

Coming up with an example that didn't comport with his apparent definition (averages) he sought not to explain why it didn't work or to evaluate whether WTC was actually not a valid idea; but instead chose to fool around with math to come up with a way to "make it sort of work."

Putting code into the essay (totally unnecessary) is a kind flag for obfuscation and amateur hour.

Expand full comment

My first guess unburdened by the thought process was1500 miles which turns out to be within 72k of the right answer I surprised myself. My second guess my reasoning was, well if I'm off by a non-trivial amount....

So, maybe the second question should be just "guess again" which is closer to how a crowd works

Expand full comment

You say the right answer is 2486, but then use 2487 in all the calculations.

Expand full comment

I've taken a quick gander at the survey results, and I think you might have ballsed this one up.

There's a problem with the first question, in that there are two possible answers; by road (2834), or by air (2834). That's a difference of ~350km or about 12% before you start.

As you used "non-trivial amount" in the second question, there's a spot of priming/framing going on, such that the second answer can be reasonably expected to be further away from the first than would otherwise be the case.

Anyway, off for a spot of fun slicing and dicing.

Expand full comment

The question explicitly asks for straight-line, "as the crow flies" distance

Expand full comment

Isn’t wisdom of the crowds in everyday issues a big part of asking a friend for advice? This seems pretty self-evident to me.

Expand full comment

No. Asking a friend is a sample of 1.

But implicit in WTC is sampling theory.

There are also many other caveats (Kasparov beats the world) and assumptions (presence of diversity of guessers) which I am sure you can look up which are not present in asking a friend for advice.

Expand full comment

Except that asking yourself again (about the distance to Moscow) is even more low-sample than asking a friend: even less diversity and even more confirmation bias. But it seems (at least according to our host) to have at least some value.

Expand full comment

But asking yourself again is not the wisdom of the crowds, as you correctly point out. No increase in diversity.

So EVEN if it has some value, it is for a different reason than the underpinnings of WTC.

The entire essay is flawed.

Expand full comment

Asking a friend is a sample of 2: you and the friend. Asking multiple friends, which you might do for important decisions, is a sample of many.

Expand full comment

Ok. But not really a sampling size of sufficient power.

This might be a start:

https://link.springer.com/article/10.1007/s41064-022-00202-2

Expand full comment

According to Scott, the Van Dolder and Van Den Assem article found "You can approximately halve outer crowd error (in this task) by going from one to two people (this wasn’t true in my Moscow task!).". That's a huge improvement!

Expand full comment

I'm a little mystified that you assert nobody thinks about the wisdom of the crowds, either inner or outer, in their ordinary lives.

In my world, asking people who you know (and sometimes even that you don't know, like in a blog comment section) for their thoughts before you make an important decision -- consulting the "outer" crowd -- is ubiquitous. I can't think of anyone who *doesn't* do this. SImilarly, "sleeping on it" or "not making decisions hastily," which amounts to asking the inner crowd (i.e. "re-evaluate this estimate again after some time has passed") is also ubiquitous.

Expand full comment
Feb 7, 2023·edited Feb 7, 2023

For that matter, respect for the wisdom of the crowds is often a reason conservatives are conservative ("If the 100 billion human beings who lived before now thought it was a good idea to have one pronoun for people born male, and another for those born female, maybe one should respond cautiously to the 500,000 who want to make a big change today") and one reason why policy-makers making weighty policy would be conservative ("If every Secretary of Defense from 1790 on down made decisions by talking to generals, Senators, and professors of military history, maybe I should be cautious about throwing that all out and watching a prediction market.")

Which is to say, the people who baffle you by not embracing newfangled methods of prediction may actually already understand the wisdom of the crowds pretty well, and be using it.

Expand full comment

We use wisdom of crowds all the time, and have for thousands of years, without a numerical component. A king's advisors are literally a core example, but extend to something like the Cabinet in the US, or a board of directors at a large company. If I'm thinking of making an important life decision, I may check in with my spouse, my sister, my best friend, my pastor, my financial advisor, and whoever else. All are using a type of wisdom-of-crowds.

Do you not see those as the same for some reason?

Expand full comment

I already do this when I make estimates, and I think many other people (less than 1 in ten, but at least 1% or so) do too!

Specifically, when I am making a you-only-get-one-guess kind of guess, and it's important that I'm maximally precise (such as when the best guess, of hundreds, wins a prize, but there's no prize for being almost as close), I start by asking what number I'd throw out. I just cough something out via whatever estimation tool pops to mind. Then I try to identify, assuming that's *wrong*, which *way* it's wrong--meaning take a second guess, with a different estimation tool, or a more careful use of the first tool. Then a third. And etc. I'll also put error bars on my guesses' estimation tools (e.g. an estimate arrived at via multiplying four numbers with plus-or-minus 50% has a much bigger error range than an estimate arrived at via adding four numbers with plus-or-minus 50% error bars.)

I think when the stakes are high, people already do this. When the stakes are low, mostly they don't--so the "this is OP" will show up in survey questions, but not in real life (or less so) for stuff that matters.

Expand full comment

What you are doing is wise but it is NOT WTC, it is just trying to be a better predictor.

Expand full comment
Feb 7, 2023·edited Feb 7, 2023

There are 6,379 questionaries where the question was answered both times. 6,076 had both answers between 500 and 20,000 km. I am using the 6,076 as my population for the analysis. I binned the answers into increments of 250 (e.g. Bin 2,500 would contain the count of all guesses from 2,251 to 2,500) in order to remove odd cutoffs since most answers were round numbers, I didn’t want a bin ending in a 0 or a 9 to change my results much.

For the 2 closest bins (2,500 and 2,750 which would account for guesses between 2,251 and 2,750 km) There were the following number of guesses in that range:

Guess 1s: 579

Guess 2s: 626

Averages: 914

Of those that originally guessed in that range, 227 (39%) had Averages in those 2 bins.

When you include the next 2 bins (2,001-3,000 range) the effect gets smaller

Guess 1s: 1,485

Guess 2s: 1,583

Averages: 1,627

When you include the next 2 bins (1,751-3,250 range) the average gets worse

Guess 1s: 2,424

Guess 2s: 2,375

Averages: 2,185

870 had their Guess 1 in the 2,000 bin, of those, 444 had a better average (Between 2,250 and 2,750). 102 had higher averages that were just as bad or worse (>=3,000) and 320 had lower averages that were just as bad or worse. (<=2000)

817 had their Guess 1 in the 3,000 bin, of those, 311 had a better average (Between 2,500 and 2,750).

456 had higher averages that were just as bad or worse (>=3,000) and 50 had lower averages that were just as bad or worse (<=2,250)

The below contains most of the data 5,357 of the observations. The first column is the bin their Guess 1 was in, the second column is the count of guesses, the third column is the amount of averages that were better than the Guess 1, and the fourth column is just column 3 as a percentage.

Bin Count Improvements % Improvements

500 66 62 94%

750 80 48 60%

1000 533 314 59%

1250 182 118 65%

1500 412 277 67%

1750 146 88 60%

2000 870 444 51%

2250 89 39 44%

2500 526 0 0%

2750 53 0 0%

3000 817 311 38%

3250 69 28 41%

3500 276 135 49%

3750 30 13 43%

4000 521 261 50%

4250 25 14 56%

4500 139 59 42%

4750 10 5 50%

5000 513 264 51%

Overall, it looks like those who initially guessed low had their average improve their score while those who initially guessed high did not. I am not sure what to take away from all of this, besides it’s not obvious that an individual guessing is a good way to go about increasing accuracy. There may be an effect where the best original guessers could do even better by multiple attempts.

Expand full comment

Maybe the real world doesn't use proper scoring rules, so we don't instinctively make decisions that are good at maximising brier scores or whatever metric you want to use.

Like if you come to a fork in the road, with one path taking you north and the other east, and you think that the eastern path is likely to be 3 times worse than the northern one, you don't average them and set off NNE.

Expand full comment

Random fun and reasonably relevant math fact: arithmetic mean and geometric mean are both special cases of an elegant generalization of pretty much all possible kinds of means: the power mean. https://en.wikipedia.org/wiki/Generalized_mean

Expand full comment

Not at all what I thought you were testing. I thought you had found some different magic hack, because when you asked the first time I said 3000km and the second time I said 2500km, almost bang on when I checked.

Expand full comment

About your last few paragaphs, I agree with most of your remarks about how the general inability of people to consistently and adequately quantify "vague feelings", and that knowing that, finance people will prefer models they trust to relying on people.

However, something in your paragraph about forecasting and the way it's not used in the real world brought me back to the fact that most decision-making is going to be not an applied probability, but a black-or-white decision. "What is the probability that I should work in academia" actually makes little actionable sense if it's not >80% or <20% (adapt the numbers to your risk-aversedness). And actually, going back to forecasting, it is very counter-intuitive that black-or-white events should be probabilized: either you believe they will happen, or you believe they will not -- it's not like there can be anything in between. What people making decisions want is not a 66% chance: they want a 0-or-1 belief that can be explained.

Expand full comment
Feb 7, 2023·edited Feb 12, 2023

Your method of using the geometric mean of the absolute error doesn't work well as a summary of how far off the typical answer was. Suppose for example the true answer to some question is 20, and the guesses are distributed uniformly randomly within the interval [19,21] (the exact form of the distribution doesn't matter, so long as it's continuous with a non-zero density near the true answer). After taking the absolute error, it's uniformly random in [0,1]. If this average is well behaved, in the limit we can replace the product with an integral exp(integral from 0 to 1 of ln(x) dx). The integral is negative infinity [edit: As Matthieu pointed out this is not correct, and therefore neither are the things that follow.] so the answer (exp of that) is 0. This means the average is not well-behaved, but intuitively it implies that the geometric mean of the absolute error will tend to zero as the number of samples tends to infinity, even if the actual average error remains constant. Note that this is not the error of the average, but the average of the error, which should remain non-zero. This is probably why the size-ten crowds appeared better than the full-size crowd, because this method of averaging over-emphasises values near zero.

A more reasonable way to combine the geometric mean with estimating errors would be to take the logarithms of all the estimates and the true value, calculate the mean-squared error or mean absolute error or something of the logarithms, then either use this result as-is ("estimates were off by X orders of magnitude on average") or take the exponential of it ("estimates were off by a factor of [e^X] on average"). In either case the result is dimensionless rather than being a kilometer value.

I may at some point re-do your analysis with this method and see how much it changes the results.

Expand full comment

You can't just exponentiate to get back to levels from logs because of Jensen’s Inequality. You can convince yourself of that by taking a mean of some strictly positive numbers and comparing that to the exponentiated mean of the logged measurements. It will be too small.

Expand full comment

Tangentially related:

In the interest of exploring the use of a geometric mean but without the undesirable property that any single datum of (0) will nuke the mean to (0)... I was wondering whether it was ever reasonable to raise each datum to (e^x), run the geometric mean, and then take the log of that. in other words, if

f(x) = e^x

g(x) = product(x_i)^(1/n)

h(x) = ln(x)

then

h(g(f(x))) = ln(product(e^x_i)^(1/n))

h(g(f(x))) = 1/n ln(e^sum(x_i))

h(g(f(x))) = 1/n sum(x_i)

Doh, it turns out I reinvented the arithmetic mean. Must be a sign that I oughta crack a textbook. And I should have known better anyway, because I already knew in the back of my mind that addition and multiplication are related by exponentiation. In any case, today I learned why the arithmetic mean is a reasonable default.

Expand full comment

I don't understand what you mean, in particular by "levels". Do you mean that geometricMean_i(x_i) is not e^arithmeticMean_i(ln(x_i)) somehow? Jensen's inequality implies that the geometric mean is less than or equal to the arithmetic mean, but that's the point in using the geometric mean.

Expand full comment

Could you get valuable results for a single person by asking about the distances between two different pairs of cities which are (roughly) the same distance away from each other? You would get some confounding results based on personal knowledge of geography, but it might be useful way of averaging multiple rough guesses from a single person.

If you're asking about a single data point, I wouldn't expect multiple guesses from a single person to be particularly useful. Asking for an error margin (or a min/max value) is the only way I can think of to actually get additional useful information from a single person.

Expand full comment

Opinions are not formed independently, there are latent (hidden) relationships that imply correlation structure among the responses (gender, politics, shared exposure to blogs, location, SES, ...) effectively reducing the size of the crowd.

Expand full comment

Scott, surprised you didn't mention the TLP article on the same topic.

https://thelastpsychiatrist.com/2009/01/gods_cheat_code_for_accuracy.html

Expand full comment
Feb 8, 2023·edited Feb 8, 2023

I don't think your second question's formulation was useful for testing your hypothesis, but it is an interesting illustration of the wisdom of the crowd effect. If everybody gives their best guess first, and is told to think it is very wrong, and made to give a second "guessier" guest you don't have a wisdom of a crowd of size = 2x, you have the wisdom of a crowd of size X, and the intentional worse effort of a crowd also of size x, and are taking the average.

Here, everybody gave their best guess and were 918 km off. They were told to assume their first guess was non-trivially wrong, and take a second guess. Collectively they were able to reduce their individual "wisdom" by making a guess that was not their actual first choice, and as a crowd were successfully not as correct as their actual collective first choice.

I'm curious if this would be replicable. "If you ask a crowd a knowledge question and ask them to guess, then tell them to assume their first question was nontrivially incorrect and guess again, the average of the first guesses will be closer than the average of the second guesses to the correct answer". In other words, respondents are successfully and correctly assigning a relative probability rating to their top two guesses

Expand full comment

https://www.youtube.com/watch?v=CNvz91Jyzbg&t=3043s

Generally interesting interview with Edward Thorp-- this is the section where he explains how he found out that Madoff was a fraud, and the importance (quite specifically) of actually understanding what's going on rather that trusting the wisdom of crowds. The crowd trusted Madoff.

I *think* the wisdom of crowds is about estimates, while it's possible that there are things which can actually be understood. How can you tell if you're dealing with something which can be understood?

Expand full comment

https://www.youtube.com/watch?v=CNvz91Jyzbg&ab_channel=TimFerriss

Not the first place I've heard the idea, but Edward Thorp argues that that holding index funds are the best investment, and part of that is because getting in and out of stocks (and other investments?) is heavily taxed.

Is substantially taxing getting in and out of particular stocks a good idea? Are there costs to pushing people into index funds?

Generally interesting interview.

Expand full comment

Though for a lot of people, the bulk of their holdings are in IRAs and 401(k)s that makes those taxes less of a factor.

As far as I know passive investing (in index and target date funds) is still generally considered the most prudent option. But if there's an investment style that's being cramped by capital gains taxes, that's one place it would be possible to try it. (For someone less risk averse than me, anyway.)

Expand full comment

Isn't democracy wisdom of the crowd at scale? And explicitly so if we look att the jury theorem, which was an argument for democracy from Mr. Condorcet.

Expand full comment

"If you could cut your error rate by 2/3 by using wisdom of crowds techniques with a crowd of ten, isn’t that really valuable?"

Yes! We use wisdom of crowds all the time! When you ask your friends for advice, that's wisdom of crowds. You don't ask them for numerical estimates of happy you'd be in academia vs. industry because our brains didn't evolve to deal with numbers. Instead, you ask them whether they think you should go into academia and industry. They subconsciously weigh all the information they have, and give you a subjective sentiment. If Alice, Bob, and Charlie all say you'll be miserable in academia because of X, Y, and Z, except Alice and Bob say you'll be slightly miserable and Charlie says you'll be very miserable, that's pretty good evidence against going into academia.

Democracy is also wisdom of crowds. You don't ask people what the chances of Russia invading Ukraine are. Instead, you do opinion polls to ask them what the proper foreign policy toward Russia should be, which is what you want to know anyways. The war hawks that would get us into a nuclear war cancel out the hippies who would have Russia roll all over us, and in the median you get a more or less reasonable foreign policy.

Expand full comment

I was one of the people who gave a ridiculously far-off estimate. Partially this was due to me not remembering how far a kilometer is and going "well, it's either roughly half a mile or roughly two miles ... roughly two miles it is"

Expand full comment

There's an easier analysis of the guessing twice strategy:

For each individual, which guess or average is closest to the correct answer?

Using the public data of everyone who answered both,

First guess is best 3036 times.

Second guess is best 2569 times.

The average of the two guesses is best 643 times.

The geometric of the two is best 358 times.

So in this data the first guess is most often the closest one. However, if you combine the second guess and the two averages, one of them (but you don't know which) is best more often than the first guess.

That doesn't feel surprising. It's another way of saying "Consider other values, and move your guess closer to values you consider probable." Which is something you're probably doing when you're guessing anyways. There might be some value in writing down a number of guesses, rating them, and then iterating with more guesses and ratings until you feel like you can't improve any more. As a way to simply force yourself to have more thinking time on the problem considering different possibilities.

Expand full comment

You have used the abbreviation “OP” twice in this post without mentioning it what it means.

I have noticed that ACX and LW posters also tend to use abbreviations a lot and just assume that their audience is in the know. This is very frustrating and does a disservice to those who are new to your community and are trying to learn and enjoy being a part of these discussions.

Does any know what “OP” refers to?

Thanks

Expand full comment
Comment deleted
Expand full comment

Touche!

Expand full comment

overpowered

Expand full comment

Thanks!

Expand full comment

In the case of prediction market contests, there's an additional factor that would cause averages to be more accurate, even beyond the usual wisdom of crowds. The goal of a participant isn't to minimize their *expected* error, the goal is to *maximize the chances of rising to the top*. Realistically, the only way to win a contest like this is to gamble on some unlikely outcomes and hope for the best. But if you average a crowd together, they'll probably gamble on different things and the outliers disappear.

Expand full comment

I'm pretty sure at least one of the very low outliers was my honest answer. I guessed that the round-number km distance from equator to pole was probably 1000 rather than 1,000,000, and Paris-Moscow was probably about a fifth of that.

I think it was wrong to remove outliers.

Expand full comment

This phenomenon is similar to test time augmentation in ML. There you do multiple predictions with a bit of randomness injected into the same inputs. Then you take those multiple predictions sourced from the same model and average them.

If you instead enable dropout, which is kind of like a human being on lsd. It's called monte carlo dropout'

Expand full comment

Spooky. The Van Dolder paper is on my very short list of studies to read soon.

The Good Judgement Project had an experimental condition where some participants made predictions by simultaneously betting in a prediction market and sharing the expected likelihood of an event directly. Their aggregation algorithm achieved highest accuracy when both the bet and the direct prediction were incorporated, implying that they each held some different information. They speculated this was due to the "inner wisdom of crowds" thing, and I think cited that same study.

Expand full comment

Ask the entire population of the world whether God exists. The wisdom of the crowds would point to yes.

Expand full comment

Why would you have to convert life decisions into numerical values? Consulting family, friends, etc is pretty basic human behaviour, and they certainly may be of clearer mind about your preferences, character, capacities etc than you, at least in certain aspects, and many instances.

Expand full comment

I can offer an anecdote about that.

Just yesterday, at a mall, I spotted one of these "guess how many marbles are inside and post the answer online" contests. As it happens, I read this post just recently, so I figured I might stand a good chance if I tried this technique: make a few logical guesses, each time assuming my previous guess was wrong in some way, then average them all.

My guesses? 6000, 9000, 4851, 5198. So my answer was their root mean square, 6472. (For some reason I felt the r.m.s. was better than a mere average.)

The actual answer? 6498.

99.6% accuracy. My mind was blown.

(In case you're wondering what I won, the answer is: nothing. Turns out the contest ended 18 months ago, and then they just left the big glass case with marbles there, complete with out-of-date instructions.)

Expand full comment

Since multiple estimates from one person seem to be more powerful the less correlated they are, I wonder if there are any strategies-of-thumb which might reduce the correlation among an individual's estimates (other than the obvious wait-until-they-forget-their-other-answers).

e.g.

Estimate 1: "gut feeling",

Estimate 2: "Fermi estimate",

Estimate 3: "some other semi-structured way of extrapolating unknown quantities from known quantities",

etc

Expand full comment

This is a well studied property of statistics. If you are trying to estimate the mean, odds are just taking the mean of your sample will be incorrect. If you throw in a completely random number, the odds are good that you will get a better estimate. It's sort of like the Monty Hall three doors paradox.

This may seem totally strange, but statisticians use this to improve their estimates of the mean by a process called bootstrapping. (There are variants of this. e.g. jackknifing.) The idea is to take the means of different subsamples of the original sample and using them to derive a better estimate of the mean.

It's not so much a property of crowds as a property of numbers.

Expand full comment

> 1/ERROR = 2.34 + [1.8 * ln(CROWD_SIZE)]

I think something is wrong with this equation. 1/ERROR should always be < 0.005 because the error is always ~200km or more, ln(CROWD_SIZE) should be >1 for any crowd sizes of at least e (2.718), so eg a crowd size of e (2.718) gives predicted inverse error 2.34 + (1.8 * 1) = 4.14 or an error of 1/4.14 = 0.24 km, which is way off. Unless I'm missing something, which is entirely possible.

I tried replicating this and the best-fit curve I found was 1/ERROR = 0.00093 + [0.00073 * ln(CROWD_SIZE)]. (Side note: I only had n=6537, not 6924 as you said, after eliminating all blank answers.)

Expand full comment