324 Comments

> And then generalize further to the entire world population over all of human history, and it stops holding again, because most people are cavemen who eat grubs and use shells for money, and having more shells doesn’t make it any easier to find grubs.

This is inaccurate. The numbers are pretty fuzzy but I find reputable-looking estimates (e.g. https://www.ined.fr/en/everything_about_population/demographic-facts-sheets/faq/how-many-people-since-the-first-humans/) that roughly 50% of humans who ever lived were born after 1 AD.

Expand full comment

Also, I am often satisfied with psychological studies that may only apply to modern Americans. Or even modern upper-middle class Americans (if so labelled). If the findings don't hold in ancient Babylon I am okay with that.

Expand full comment

Exactly. There’s nothing wrong with studies that only apply to developed Western democracies if the people that will read and use the info are from developed Western democracies.

Expand full comment

Generalizability says a lot about mutability. Show me a population that thinks differently from mine and that is at least weak evidence my population can be changed.

Expand full comment

Having more shells might not make it easier to *find* grubs, but it does make it easier to *trade* your shells for grubs.

Expand full comment

Yeah, that part didn't make sense either. If having more shells doesn't allow you to obtain more grubs and services, then I don't think your society really uses shells as money.

Expand full comment

No. In that specific case it doesn't work. It would work for nuts, but grubs won't keep if you don't have some method of preserving them. You eat them as you dig them up.

OTOH, there's also a lot of question as to what "money" would mean to a caveman. It's quite plausible that saying they used shells for money is just wrong. But maybe not. Amber was certainly traded a very long time ago, and perhaps shells were also, but saying it was money is probably importing a bunch of ideas that they didn't have or want. That trading happened is incontestable, but that's not proof of money.

I *have* run into reports of shells being used for money, but I've also read of giant stone wheels being used for money. I really thing it's better to think of things like that as either trade goods or status markers. Those are features that money has, but it also has a lot of features that they don't, like fungibility.

Expand full comment
author
Dec 28, 2022·edited Dec 28, 2022Author

This is irrelevant and I shouldn't be arguing about it, but since I am, James Scott says that until 1600 AD the majority of humans lived outside state societies. I don't know if they were exactly grubs-and-shells level primitive, but I think it's plausible that the majority of humans who ever lived throughout time have been tribalists of some sort.

Expand full comment

The gulf between shells and grubs level primitive and 1600 AD pre state "primitive" is not irrelevant.

Wealth doesn't become irrelevant either no matter how far back you go, it's just accounted for differently.

Expand full comment

Granted! But the sentence as written absolutely implies a less nuanced comparison, e.g. BC population > AD population; one seldom thinks of more-modern stateless societies as mere shell-grubbers. ("Don't they, like, hunt and gather or something?") Seems like it'd pass a cost-benefit analysis to reword for clarity, or just remove entirely, since it is in fact irrelevant to the rest of the post. Noticing-of-confusion ought to be reserved for load-bearing parts, if possible.

Expand full comment

I agree. But it shows a pattern of Scott's repeated dismissal of valid criticism towards contrived selection criteria. Ie: doubles down with some weird and wrong cutoff for when we moved away from shells and grubs, or something akin to that.

It's similar to dismissing outright criticsm that his selection criteria for mental health might be invalid.

Expand full comment

I frankly didn't understand the hubbub about the previous post. Felt like one continuing pile-on of reading-things-extremely-literally leading to bad-faith assumptions and splitting-of-hairs, for what seemed like a pretty lightly advanced claim. It's riffing half-seriously on some random Twitter noise, the epistemic expectations should be set accordingly low. Like some weird parody playing out of what Outgroup thinks Rationalists are really like. Strange to watch.

Which doesn't give Scott free cover to be overly defensive here (even about minutiae like this - it's clearly not irrelevant if there's like 5 separate threads expressing same confusion, and you can't do the "I shouldn't argue about this, but I will" thing with a straight face), which is also Definitely A Thing. But I'm sympathetic to feeling attacked on all fronts and feeling the need to assert some control/last-wording over a snowballing situation. Sometimes the Principle of Charity means giving an opportunity to save face at the cost of some Bayes points, especially when it seems obvious both the writer and the commentariat are in a combative mood to begin with. Taking a graceful L requires largesse from both sides...

Expand full comment

Absolutely. The follow up thread from the original poster does a good job clearing up what he "really meant" by the tweet. There's some grace towards Scott, though probably more bitterness.

And yes, I would have been more okay with Scott's first article if he took a lighter approach, or even doing the same work and ending with a more open conclusion. I.e. there are a lot of people that report happiness and well being without also reporting spiritual experience, then go on to acknowledge, yes, this a half serious tweet, intentionally provocative, but what else might they might be pointing at?

Democratizing spiritual, mystical, or at least profound and sacred experience is vitally important. Attacking the claim with an unbalanced, unwavering, "fact-check" comes across as antithetical to that purpose.

Expand full comment

I think that the "obesity vs. income" concept becomes overextended once you go back to primordial times. "Income" implies an at least partly monetized economy; that is not how cave people traded stuff, so the question becomes meaningless for them.

Expand full comment

Wealth still applies. But I get what you're saying. Still, orders of magnitude more people existed after the widespread adoption of agriculture. I don't appreciate Scott's dismissal of this by contriving an arbitrary cutoff of 1600AD as the time when most people were now ruled by a state.

Expand full comment

Would this be something like how you expected things to work out if you had a conflict and needed help with it? Like, do you go to the police/legal system, or to your local lord, or to the local big man in the village, or to your brothers and cousins and extended family, or what?

Expand full comment
Dec 29, 2022·edited Dec 29, 2022

> James Scott says that until 1600 AD the majority of humans lived outside state societies

For all that I enjoy James Scott's work, I'm very skeptical of that claim. Estimates from multiple sources here https://en.wikipedia.org/wiki/Estimates_of_historical_world_population put world population around year 1 as between 150 and 300 millions, the average being 230 millions. Population estimates for the Roman Empire and Han China give 50-60 millions each*; Persia probably adds a couple tens of millions, and I'd expect the kingdoms in the Ganges plain to be not very far behind China. I would be very surprised if non-state people had amounted to more than 25% of the world total. Perhaps JS was counting some people who lived physically inside state boundaries as not really being *part* of state societies?

The majority of all humans *ever* is more plausible, though (most would still be farmers of some kind, but not necessarily ruled by a state).

* Eerie how evenly matched Rome and China were around year 1. Kind of a pity they never got to interact directly.

Expand full comment

Estimating the population of non-state societies is obviously difficult, but, yes, my guess would be that this is overlooking that the much higher population density of state societies. Any way, it's irrelevant because most people certainly lived in agricultural societies, which are able to accumulate wealth and trade it for other resources.

Expand full comment

Yeah, it's pretty counterintuitive.

Also, there were only about 115B humans, ever, so far. Which means that death rate is just 93%, currently. Before I saw that number, intuitively, I'd think it's something akin to >99.9% - virtually everyone died.

This makes not solving aging ASAP rather catastrophic. Sure, there were countless generations before who all died - but at least the population was relatively small back then...

Expand full comment

"Which means that death rate is just 93%, currently."

I have been know to point out that contrary to the common belief that death is certain, a more scientific approach notes that if we randomly select from all humans only 93% of them have died. With a large enough (and random enough) sample the obvious conclusion is that any human has only a 93% chance of dying rather than a 100% chance of dying.

Most people stubbornly cling to the common belief, however, even when shown the math. Sad.

Expand full comment

Empirical odds of any given human surviving 120 consecutive years still seem very slim.

Expand full comment

You're counting everyone currently alive as a sample case of a human that won't ever die.

Expand full comment

Well, by definition alive people haven't died ...

Expand full comment

worth noting that measuring this naively most people who have ever lived died before age 3, so you probably want to pick a better precise measure

Expand full comment

I assumed Scott was just exaggerating to make a point

Expand full comment

I'm pretty confused by this kind of attitude. To be quite frank I think it's in-group protectionism.

I'll start off by saying I think most psych studies are absolute garbage and aella's is no worse. But that doesn't mean aella's are _good_.

In particular, aella's studies are often related to extremely sensitive topics like sex, gender, wealth, etc. She's a self-proclaimed "slut" who posts nudes on the internet. Of course the people who answer these kinds of polls _when aella posts them_ are heavily biased relative to the population!

I think drawing conclusions about sex, gender, and other things from aella's polls is at least as fraught as drawing those conclusions from college freshmen. If you did a poll on marriage and divorce rates among college-educated people you would get wildly different results then at the population level. I don't see how this is any different from aella's polls.

Expand full comment
Dec 27, 2022·edited Dec 27, 2022

>If you did a poll on marriage and divorce rates among college-educated people you would get wildly different results then at the population level. I don't see how this is any different from aella's polls.

If you did the former in real life on a college campus you could publish your results in a journal, potentially after playing around a bit to find some subset of your data that meets a P value test. If you run an internet poll you will be inundated with comments about selection bias and sample sizes to no end.

There is no actual difference but there is a massive difference in reception/perception.

Expand full comment

My comment says explicitly "I'll start off by saying I think most psych studies are absolute garbage and aella's is no worse."

Just because there's high prestige in publishing garbage doesn't mean truth-seeking people should signal boost bad methodological results. As I said, this sounds like in-group protection, not truth-seeking.

Expand full comment

We don't know what you may be rolling into the word "most".

Imagine data X is published by aella on twitter, and Y is published by the chair cognitive science at harvard in the American journal of psychiatry. Most people, it's fair to say, will automatically give a bit more credit to the latter. If that's you, that's understandable.

But the *reason* for doing that should never be given as "aella has selection bias" - they will both have some amount of that. If the reason is going to be "one has MORE selection bias", then that should be demonstrated with reference to the sample both parties used.

The actual reason we give less credit to aella is likely to be about the fact that some sources of information are more "prestigous" than others; whether or not that information tells you anything about reality is, often, irrelevant. This creates a bias in all of us.

Expand full comment

You can't just simultaneously make a demand for greater rigor, and also ensconce internet polling behind a veil of unfalsifiability. It seems like you're asserting that we should treat these things equally until given a substantive reason for doing otherwise.

If the prestige and the selection bias are both unknown quantities and it's either impossible or impractical to establish quality and magnitude of effect, then why push back against the skepticism?

This article would be more truthful if it made the simple observation that data, as it is, generalizes rather poorly.

Expand full comment

I don't think you fairly described the point of the person you are responding to.

Expand full comment

That may be true. It's a bit difficult to discern the underlying arguments since the disagreement I'm responding to is happening on a higher level. I'm not trying to be disputatious for the sake of it, but it seems like there's something contradictory about the way the discussion is being framed.

Expand full comment

It's not just bad methodology, it's antithetical to psychology, it doesn't study psyche as an absolute but hand waves around the topic and commits itself to utilitarianism. It makes what random groups say about themselves provisionally universal (but not really, haha, we're too postmodern for that....). It's not psychology at all, and it should out itself as demographic data in every headline and every summary so that people can see who and why the illusion of generalization is being made in the name of science. This is Scott's worst take ever!

Expand full comment

...what? What should out itself as demographic data, and what is it being perceived as instead, and what illusion of generalization is being committed by whom?

Expand full comment

Journals vary significantly in their publishing standards, up to and including pay to play ones existing, but every standard psych undergrad education involves teaching students to think about selection bias in psych studies run on people willing to participate in psych studies at a college. This is as psych 101 of an observation as it gets. People in leadership positions at journals think about this too, and it's too cynical to assert they'll publish anything if the authors just fiddle with the p-values the right way. That's not really true once we separate out that the term "journals" includes everything from reputable high-quality journals to predatory fly-by-night operations.

Expand full comment

>but every standard psych undergrad education involves teaching students to think about selection bias in psych studies run on people willing to participate in psych studies at a college. This is as psych 101 of an observation as it gets.

I would be more impressed with this if most of the research wasn't so garbage.

>People in leadership positions at journals think about this too, and it's too cynical to assert they'll publish anything if the authors just fiddle with the p-values the right way.

I think you mean publish anything if it hits their feelies the right way. Academia seems to have really hemorrhaged people actually interested in the truth in recent decades. And the high quality journals are nearly as bad.

Expand full comment

Yeah, I don't agree with your assertion that most published psychological research is garbage and publication standards are little more than the emotional bias of editors due to a loss of people who care about truth.

To the original point, psychology as a field is aware of types of bias that derive from convenience samples. Individual actors and organizations vary in how well they approach the problem, and it helps no one to flatten those distinctions.

Expand full comment

Meh the field can win back my respect when it earns it. Too much much Gell-Mann effect triggering all the time from most non hard-science parts of academia (and even a little bit there) to really give it much faith these days.

You cannot be double checking everything you read, and so often when you do the papers are poorly thought out, poorly controlled, wildly overstating what they show, etc. And that is when they aren’t just naked attempts to justify political feelies regardless of what the facts are.

Academia had a huge amount of my respect from when I was say 10-20, to the extent I thought it was the main thing in society working and worth aspiring too.

Since then I have pretty much been consistently disappointed in it, and the more I look under the hood the more it looks like the Emperor is substantially naked.

Still much better than 50/50, but that isn’t the standard I thought we were arriving for…

Expand full comment

Sounds like you agree with Scott's point: that "real" surveys also have unrepresentative samples so it doesn't make sense to single out Aella for criticism.

Expand full comment

Most people do not have a good way to respond to the authors of "real" surveys in a public way.

Expand full comment

It seems clear to me that Aella's audience vs Aella's questions are massively more intertwined, than psych students vs typical psychology questions. Both have some issues, but no question I'd trust her studies less.

Expand full comment

This is fair. For me the big focus of a piece like this would be less on defending her/twitter polls and more on dragging a lot of published research to just above "twitter poll" level.

Expand full comment

"In particular, aella's studies are often related to extremely sensitive topics like sex, gender, wealth, etc."

I think Scott's point about correlations still applies. For example, if she tries to look at 'how many times a month do men in different age bins have sex', it doesn't particularly matter that her n=6000 are self-selected for the kind of man who ends up a twitter follower of a vaguely rationalist libertine girl. The absolutes may (for whatever reason; maybe her men are hornier, maybe they're lonelier) be distinct from those of the general population, but she can draw perfectly valid (by the academic standard of 'valid') conclusions about trends.

She does seem to have the advantage of larger sample sizes than many studies. And if she were writing a paper, she'd list appropriate caveats about her sample anyway, like everyone else does.

Expand full comment

She makes claims of the form "more reliable" (https://twitter.com/Aella_Girl/status/1607482972474863616). Most people would interpret this as "generalizes to the whole population better." I simply don't think this is true.

Her polls do certainly have larger sample sizes, but it doesn't matter how low the variance is if the bias is high enough.

Expand full comment

Re: “more reliable”, I wouldn’t interpret that as “generalizing to the whole population better”. One possible goal is to get results that generalize over the whole population, but that is far from a universal desideratum. Many interesting things can be learned about subsets of the population other than “everyone”!

For example, if I’m interested in what people in my local community think of different social norms, attempting to get more representativeness of the US population as a whole would actively make my data worse for my purpose.

Expand full comment

I think that's fair. It wasn't a very precise statement, but that claim comes across as overbold.

"No less reliable" on the other hand...

Expand full comment

There are two forms of reliability, sometimes called "internal validity" ("is this likely to be a real effect?") and "external validity" ("is this effect likely to generalise to other settings?") They're often in tension: running an experiment under artificial lab conditions makes it easier to control for confounders, increasing internal validity, but the artificiality reduces external validity. Aella's large sample sizes give her surveys better internal validity than most psychology studies (it's more likely that any effect she finds is true of her population); it's unclear whether they have better or worse external validity.

Expand full comment

It’s not at all clear to me that the binning would work accurately. What if Aella has a good sampling of young men, but only lonely old men without partners follow her? Then that one cohort would be off relative to the others. You can come up with many hypotheticals along these lines pretty easily.

I think Aella and Scott’s surveys are great work and very interesting, but I think it’s also important to keep in the back of your mind that they could have even more extreme sample issues than your average half-baked psych 101 study.

Expand full comment

This is not correct. Aella's blog is selecting for "horny people," or at least "people who like reading and thinking about sex."

If old men in the real world are less horny than young men, they'll be less likely to read Aella's blog. As a result, Aella's going to end up drawing her young men from (say) the horniest 50% of young men, and her old man sample from (say) the horniest 10% of old men. As a result, she'd likely find a much smaller drop-off in sexual frequency than exists in the real world (if sexual frequency is positively correlated with horniness).

Expand full comment

Yes, although that last if is a big one. Are they hornier and thus have more sex, or do they have less sex, making them hornier?

Expand full comment

Right--the bias could go either way! Which makes the whole thing a giant shitshow, because your relationship might be upwardly or downwardly biased by selection and it's hard to say how big of a bias you're likely to have.

What you'd probably want to do if you were doing a careful study with this sort of sample would be to characterize selection as much as possible. How underrepresented are old men in your sample relative to samples that aren't selected on horniness (i.e. maybe compare Aella's readership to the readership of a similar blog that doesn't talk about sex). How does the overall rate of sexual activity of your readers compare to some benchmark, like published national surveys? If your readers have a lot less sex, then you're probably selecting for lonely people and should expect the old men to be lonelier, on average. If your readers have more sex, it's the opposite.

But all of this is a problem when asking sex questions to a sex readership that you wouldn't have if you asked sex questions to the readers of a blog about birdwatching or something. They're not "representative" either, but they aren't in or out of your sample on the basis of their sexual attitudes.

Expand full comment

I haven't looked at Aella's survey in detail, but I presume it asks a number of demographic questions of its respondents which allows her to do basic adjustments (e.g. for relationship status). As I understand it, the quarrel isn't with her statistical methods, it's with the quality of her sample.

And the very fact that we cannot agree on a direction of bias leads me to question the likelihood that bias introduced specifically by 'aella follower' is non-uniform across age or any other given attribute. One could just as well presume that birdwatching as a hobby correlates negatively with getting laid in youth (nerds!) and positively in old age (spry, outdoorsy, quirky) - but you'd have to at least claim a mechanism in both cases, and have something to show for it empirically. And it would be interesting if you did. And, naturally, we see that kind of back and forth in ordinary research and it's the stuff knowledge is made of.

Obviously calibration and characterisation are good, but then they're always good. So is more data, especially if all unusual qualities of the sample are clearly and honestly demarcated.

Expand full comment

I don't know more about aella's survey than has been discussed here, but I agree that the question is sample quality.

The reason we can't agree on the direction of bias is because we don't know how horniness/interest in internet sexual content is related to sexual frequency. We're not disagreeing that "horniness" is likely related to age (indeed, this is the research question!) And we're not disagreeing that reading aella's blog is related to horniness. It's possible that there are a bunch of disparate relationships that cancel each other out, but the fact that blog readership is selected on a characteristic closely related to the research question makes the existence of bias quite likely, imo.

I agree with you that a bird watching sample could also have problems! Any descriptive research should think about and acknowledge the limitations of the data. That said, I'd believe a finding on age and sexual frequency that came from a bird watcher sample more than one that came from a sex-blog-reader sample. Let's say that bird watchers are awkward nerds who are physically fit enough to spend time in the woods. If nerds have less sex (debatable!), you'd expect all bird watchers to be less sexually active than non bird watchers of the same age. That's not a problem for internal validity--a bird watcher sample could still tell you how sexual frequency changes with age among nerds.

If physically fit people have more sex, you'd expect that among the general population, sexual frequency would drop off as people got older and less fit. If less fit people also stop birdwatching, you wouldn't see this (or wouldn't see it as much) in a birdwatching sample. That might be a big problem if you want to know how sexual frequency changes from 50s -80s, but probably isn't as big of an issue if you want to see how it changes from 20s-40s.

Expand full comment

Maybe, maybe not.

Among the general US population, height is weakly but positively correlated with most basketball skills (since tall people are more likely to play basketball). Among NBA players though, height is negatively correlated with basketball skills, since a six foot guy needs to be really really skilful to compete against seven-footers.

It could well be that Aella's readership is like the NBA of horniness.

Expand full comment

I can't argue against that, really. I'd only say that you'd have to show a mechanism for kind of thing before assuming it exists.

Expand full comment

I agree with this, and will add that (afaict) the majority of Aella's Twitter polls are, well, polls, akin to "how many people oppose abortion?" – the situation where Scott says selection bias is "disastrous", "fatal", etc.

Sure, some of them ask for two dimensions, so you could use them to measure a correlation, as long as you ignore all the things that could go wrong there (conditioning on a collider, etc.). But that's a fraction.

Expand full comment

I am also confused. This feels like "Beware the man of one study (unless it's Aella, then it's fine)".

If you're citing any single Psychology study, without demonstrating replication throughout the literature, you either haven't done enough research or you have an ideological axe to grind. Part of replication is sampling different populations and making sure that, at the very least, X result isn't just true for college students at Y university. Hopefully, it extends much further than that.

I don't see how Aella's data is any different than a single psychology study of a university called "Horny Rationalists U".

If what Scott is saying is that Aella's data is a good starting point and maybe worth doing some research into -- yeah, absolutely! (I feel like it won't surprise you to know that there is already a lot of psychological research, some which replicates some which doesn't, on Aella's topics).

Otherwise, I can't help but agree that this feels like in-group protectionism.

Expand full comment

> If what Scott is saying is that Aella's data is a good starting point and maybe worth doing some research into -- yeah, absolutely!

I think Scott would say that about any study, including Aella’s (except maybe “worth doing _more_ research into” because Aella’s studies are themselves research).

Expand full comment

*ARE* they research?

If, as some have said, she just asks one question of her audience, then the only thing I can think that they might be researching is "What would make my site more popular?". You need multiple questions to even begin to analyze what the answers mean.

Expand full comment
Dec 28, 2022·edited Dec 28, 2022

Nah, it's "Beware Isolated Demands For Rigor".

Some people say this is some new weird take by Scott (and/or imply that he's just unprincipedly defending ingroup), but it's really not.

Expand full comment

I would expect both the marriage and divorce rate among college freshmen to be very low. :)

Expand full comment

I worked tangentially with an IRB board at a large public university a few years back. I wasn’t actually on the board, but I worked to educate incoming scientists, students, and the community about expectations and standards for human-subject research.

In that role, I sat in a lot of IRB review meetings. Our board talked extensively about recruitment methods on every single one of those proposals. And, because we always brought in the primary researcher to talk about the project and any suggested changes, we often sent them back for revision when the board had objections about how well-represented the population groups were.

Now, there were some confounding factors that the board took into account when determining whether the selection criteria for participants needed reworking:

1) The level of “invasiveness” of the research. How risky, sensitive is the required participation?

2) The potential communal rewards of the research. Are the risks of the research worthwhile?

3) The degree to which the scientists on the board could see problems with the stated hypothesis being more general than the proposed population would support.

The reason the risks and rewards were relevant to the board were because, if the risks were lower, the selection criteria standards could be lower. Likewise, if the potential rewards to the community were higher and the risks were low, the selection criteria would also be less stringent. But anything with high risk or low potential benefit got run through the absolute ringer if they tried some version of the things you wrote out above: “The real studies by professional scientists usually use Psych 101 students at the professional scientists’ university. Or sometimes they will put up a flyer on a bulletin board in town, saying “Earn $10 By Participating In A Study!”” Perhaps things are run differently elsewhere, but that kind of thing definitely did not pass my university's IRB.

All that to say, psych studies (generally speaking) were considered by the IRB to be quite low on the invasiveness scale, but were considered of good potential value to the community, so the selection standards were…not high. While this might make for a bunch of interesting published results, I don’t know that you could, by default, argue that any of the results of the psych research at the school would translate outside of the communities the researchers were evaluating. Correlations between groups of people are super-valuable, of course, but if you're looking for more "scientific," quantifiable data to apply at huge scale, that's just not the place to find it.

That’s an important distinction, though. Because the board would have to ask the researchers to correct the ‘scale’ of their hypothesis on almost all the psych research proposals that I sat in on, as researchers were usually making grand, universal (or at least national) statements, but usually only testing very locally.

I think our IRB had the correct approach on this. And I think it’s one that others should use (including Scott). To be clear, I think this self-critique and examination is something that Scott does regularly, judging from his writing. I also think he’s aware of the “selection bias” in his polls, such that it is, in that he tries to spot easily identifiable ways in which his demographic’s results might not translate to a broader community. That doesn’t mean he always sees it accurately, but I've seen the attempts enough times to respect it.

However, from what I’ve seen, most researchers don’t have the instinct to try to find fault or limitations to their own research's relevance. Perhaps that’s just my experience of seeing so many first-time graduate researchers come through the IRB, though. I don’t have any idea who Aella is, so perhaps I’m missing something, but it seems to me that stating the possible limits of the applicability of your research should be pro forma, and I'd be very hesitant to play down the importance of that responsibility.

But I think I get where Scott is coming from. He's right that selection bias is a part of research that you can’t get rid of, and a lot of the established players seem to get away with it when "amatuers" get dismissed completely because of it. But I think it’s something that you should definitely keep in mind and try to account for in both your hypothesis and your results.

To steel-man Scott's point, I don't think he's arguing that selection bias can be a real problem (even at his most defensive, he just says it "can be" "fine-ish"). I think he's just trying to argue against using “selection bias” and “small sample size” as conversation-enders. Doesn't mean he's discounting them as real factors. But some people use them in a similar way that I often see people shout “that’s a straw man” or “that’s a slippery slope fallacy” online—to not have to think any further about the idea behind sometimes flawed reasoning. Those dismissive people often aren’t wrong in terms of identifying a potential problem, but they ARE wrong to simply dismiss/ignore the point of view being expressed because the argument used was poor. If I ignored every good idea after having heard it expressed/defended poorly, I wouldn't believe in anything.

In the same way, there's no reason to completely dismiss the value of any online poll, as long as you (the reader of a poll) have the correct limits on the hypothesis and don't believe any overstating of results.

Expand full comment

Well said.

Expand full comment

"However, from what I’ve seen, most researchers don’t have the instinct to try to find fault or limitations to their own research's relevance."

Alternatively, when applying for grants or trying to start a new big thing, researchers are often very much encouraged by grant commitee to oversell their projects, which does not incite to openly discuss the projects' limitations.

Expand full comment

Ding ding ding. You make everything a race, you end up with people focused on speed and not safety, even if you claim to be very very concerned about safety. Especially if there are few penalties for mess ups.

Expand full comment

There's nothing wrong with giving your blue-sky hopes when pitching the grant application. Why not? The program manager definitely wants to know what the best-case outcome might be, because research is *supposed* to be bread on the waters, taking a big risk for a potential big outcome.

I think the discussion here is what goes into your paper reporting on the work afterward -- a very different story, where precision and rigor and not running ahead of your data are (or ought to be) de rigeur.

Now, if you are pitching the *next* grant application in your *current* paper, that's on you, that's a weakness in your ethics. Yes, I'm aware there's pressure to do so. No, that isn't the slightest bit of excuse. Withstanding that pressure is part of the necessary qualifications for being entrusted with the public's money.

Expand full comment

This seems a really naïve interpretation of how the process actually works and what the incentives are.

Expand full comment
Dec 28, 2022·edited Dec 28, 2022

It's certainly not naive, since I've been involved in it, on both ends, for decades. You may reasonably complain that it's pretty darn strict, but that doesn't bother me at all.

Being a scientist is a sweet gig, a rare lucky privilege, something that any regular schmo trying to sell cars, or cut an acre of grass in 100F Houston heat, or unfuck a stamping machine on an assembly line that just quit would give his eyeteeth to be able to do -- sit in a nice air-conditioned office all day, speculate about Big Things, write papers, travel the world to argue with other smart people about Big Things.

If you're independently wealthy and can afford to theorize about the infinite, then do whatever you damn well please. But if you are doing this on the public's dime -- on money sweated out of the car salesman, the gardener, or the machine shop foreman -- then yeah there are some pretty strict standards. If you don't like it, check out the Help Wanted ads in your hometown and do something for which someone will pay you voluntarily.

Expand full comment

My point would be that people very rarely live up to those standards in large part because the way science is done has the incentive structure all wrong. The naïveté I was talking about was you belief the system was working very well.

Expand full comment

If Scott had a list of work Aella has done related to something like banana eating I'd take his argument much more seriously. Using such a specific example seems like a bad idea in general for the purpose of this post and it seems like an even *worse* idea to use the specific specific example he used.

I honestly wonder if there is some sort of weird trick question/fakeout going on with this whole post just because of that.

Expand full comment

There are three main points to consider that I think you're glossing over.

1. This only matters if aella is *claiming* to represent a larger population than 'people who read my blog and people who are similar to them.' Having accurate data that only tells you about one thing rather than everything, is not the same as having inaccurate data. You would have to read each blog post to judge whether or not the results are being misreprepresented each time.

2. This only matters if the selection criteria and the thing being measured are correlated. You say that aella mostly asks about things relevant to their blog which implies there will be a correlation, which I'm sure is true for some things they measure, but won't be true for everything. Psychology studies are also not as stupid about this as people imagine, it is normal to just use psych students when you are studying low-level psychophysics concepts that should not correlate with college attendance, and to get a broader sample when studying social phenomena that will correlate.

3. This primarily matters if you are doing a descriptive survey of population counts and nothing else, and matters a lot less if you are looking at correlations between factors or building more complex models. For example, lets say that you were asking about sex positivity and lifetime number of partners; sure, you might plausibly imagine that both of those things are higher among aella's audience, so you wouldn't represent the simple counts as representative of the general population. But if you were asking what the *relationship* between those two factors is, there will still be variation within that sample that lets you tell the relationship, and there's no particular reason to believe that *how those factors vary together* is different in aella's audience than in the general population.

Expand full comment

I get tons of responses from people who have no idea who I am! People seem to not understand I do research that's not just twitter polls.

Expand full comment

What studies are good?

Expand full comment

If smart people eat bananas because they know they are good for their something something potassium then we should be skeptical about the causal language in your putative study title. Perhaps something more like "Study finds Higher IQ People Eat More Bananas" would be more amenable to asterisking caveats and less utterly and completely false and misleading.

Expand full comment
author

I realized someone would say this two seconds after making the post, so I edited in "(obviously there are many other problems with this study, like establishing causation - let’s ignore those for now)"

Expand full comment

Sorry for being predictable ;)

I think I was primed to have this concern because when I initially read "selection bias" my brain went right to "selection into treatment" (causality issues) rather than "sample selection bias".

Expand full comment

Same thought. Super surprising he did not say "Study finds positive correlation between IQ and banana consumption". I thought he was going to be funny and put both disclaimers in the asterisk.

Expand full comment

I think the real difference here is that the studies are doing hypothesis testing, while the surveys are trying to get more granular information.

I mean you have a theory that bananas -> potassium -> some mechanism -> higher IQ, and you want to check if it is right, so you ask yourself how does the world look different if it is right versus if it is wrong. And you conclude that if it is correct, then in almost any population you should see a modest correlation between banana consumption and IQ, whereas the null hypothesis would be little to no correlation. So if you check basically any population for correlation and find it, it is evidence (at least in the Bayesian sense) in favor of your underlying theory.

On the other hand, if you were trying to pin down the strength of the effect (in terms of IQ points/ banana/ year or something), then measuring a correlation for just psych 101 students really might not generalize well to the human population as a whole. In fact, you'd probably want to do a controlled study rather than a correlational one.

Expand full comment
Dec 27, 2022·edited Dec 27, 2022

This is a much better explanation than Scott's post. Very helpful comment

Expand full comment

That would work very nice if the actual steps indeed were:

1. formulate hypothesis

2. randomly pick a sample

3. make measurements

4. check if the hypothesis holds for that sample

But I suspect that often it's: 2 is done first, then 3, then 1, then 4.

Then, I think it doesn't work.

And there might some approaches along this spectrum where the hypothesis "shape" is determined first, but it has some "holes" to be filled later. They holes could range in size from "huge" like "${kind of fruit}" to "small" like "${size of effect}".

Expand full comment

I mean if you formulate your hypothesis only after gathering your data and aren't correcting properly for multiple hypothesis testing or aren't using separate hypothesis-gathering-data and hypothesis-testing-data, then you are already doing something very very wrong.

But I think that there are probably lots of things you might want to test for where whether or not the effect exists is relatively stable group to group, but details like the size of the effect might vary substantially.

Expand full comment

That way lies the "green jellybean" effect.

If you have a bunch of data points, you can always find an equation to produce that collection within acceptable error bounds. Epicycles *do* accurately predict planetary orbits. And with enough creativity I'm sure they could handle relativity's modification of Mercury's orbit.

Expand full comment

Isn't the green jellybean effect like 30% of modern "research"?

Expand full comment

This doesn't sound quite right to me. I would say the main difference not to be the granularity, but that "correlation studies" look for x -> y, where both x and y are measured within persons, while polls look x between/over different persons. Thus in the "correlation studies" the interesting thing happens (in a way) within each participant of the study; whether something they have is related to something else they have (not). Thus, who the participants are is less relevant, as they kinda work as a ""control"" for themselves.

In case it sounds like nitpicking, this difference is relevant in that my point leads to a different conclusion on the effect sizes. I see no reason to think that estimating effect sizes from psych students (generalized to the whole population) is more wrong than estimating the existence of an effect. The latter is really just a dichotomous simplification of the former ("it is more/less than 0"), and if we draw a conclusion that there is an effect, say, >0, we might as well try to be more nuanced of the size of it.

Because if you say that the effect size does not generalize, why would the sign of it (+ or - or 0) then generalize? Of course, it is more possible to be right when only saying yes or no, but there is no qualitative difference in trying to generalize an effect existing and trying to generalize the size of that effect. The uncertainty of the effect being -.05 or .05 is not really different from the uncertainty of the effect being .05 and .1

Expand full comment

I don't think that this holds up. You don't really have correlations within a single person unless you measure changes over time or something. Remember correlation is defined as:

([Average of X*Y] - [Average of X]*[Average of Y])/sqrt{([Average of X^2]-[Average of X]^2)([Average of Y^2]-[Average of Y]^2)}.

It's a big mess of averages and cannot be defined for an individual person. I suppose it is robust to certain kinds of differences between groups that you consider, but it is not robust to others.

My point is that if you are looking at a relatively big effect and want to know whether it is there or not, most groups that you look at will tell you it's there. However, if you want to know more accurately how big it is, sampling just from one group is almost certainly going to give you a biased answer.

Expand full comment

Yes the correlation is measured across the sample, but my point is that: the correlation is like testing whether BMI and running speed is correlated; and the poll-approach is like testing whether running speed is say on average more than 8km/h. The former works usually in a non-representational sample because it has two measurements for each participant, and is testing the relationship between them. The latter does not, as it tries to test some quality of the people as a whole.

I still claim that it has nothing to do with the granularity, you can answer both of those questions at different levels of exactness, but only one of them can give meaningful results on an unrepresentative sample.

Expand full comment

I think we're only seeing a difference in this example because the correlation between running speed and BMI is not close to 0 while the average running speed *is* close to 8km/h. If you were trying to test whether average running speed was more than 2km/h, you'd probably get pretty consistent answers independent of which group you measured.

Actually, measuring correlation over just a single group might even be less reliable because of Simpson's paradox. The correlation could be positive within every group and yet be negative overall.

Expand full comment

I agree that most people rush to "selection bias" too quickly as a trump card that invalidates any findings (up there with "correlation doesn't mean causation"). However, I disagree that "polls vs correlations" is the right lens to look through it (after all, polls are mostly only discovering correlations as well).

The problem is not the nature of the hypotheses or even the rigor of the research so much as whether the method by which the units were selected was itself correlated with the outcome of interest (i.e., selecting on the dependent variable). In those cases, correlations will often be illusory at best, or in the wrong direction at worst.

Expand full comment

I agree that "polls vs correlations" isn't right, but partly because I think "polls" are far more heterogeneous than Scott suggests. If you want to find out who's going to win an election, then you really care about which side of 50% the numbers are on. But if you're polling about something like "how many people support marijuana legalization?" then your question is more like "is it 20% or 50% or 80%?" and for this, something that has a good chance of being off by 10 is fine (as long as you understand that's what's going on).

Expand full comment

But off-by-10% is not that bad scenario, real life can be worse. You can be off by much much more. For example by polling the much-mentioned psych students, I know from results I've seen that you can get for example >40% support for a political party that is at 10% for the overall population. So the main difference is not he hypothesis, the main difference is whether you are looking population-level descriptive statistics or a link between two variables within people ("liberals like cheesecake more than conservatives").

Expand full comment

What do you all think about the dominance of Amazon’s Mechanical Turk in finding people for studies? Has it worsened studies by only drawing from the same pool over and over?

Expand full comment

Seems probably better than just using college students tbh - you get the same pool of students volunteering over and over for studies as well, and they are more demographic-restricted.

(Of course there is natural turnover in that most college students are only college students for 4-5 years - I wonder how that compares to the turnover in MTurk workers)

Expand full comment

There needs to be a rule that you can only volunteer for one paid psychology experiment in your lifetime.

I did a bunch of these when I was in school and you quickly realize that the researchers are almost always trying to trick you about something. It becomes a game to figure out what they're lying about and what hypothesis they're testing, and in most cases that self-awareness will ruin the experiment.

Expand full comment

That is not true of most psychology experiments. For a meaningful portion of them (say 20% or something), sure, but definitely not most. But sure, this self-awareness can be a big problem (although I have participated multiple studies which included cheating, and I don't think it affected me in any of them, maybe I'm just a gullible person).

Expand full comment
Dec 27, 2022·edited Dec 27, 2022

"Selection bias is fine-ish if..."

I'm interpreting this as saying that one's prior on a correlation not holding for the general population should be fairly low. But it seems like a correlation being interesting enough to hear about should be a lot of evidence in favour of the correlation not holding, because if the correlation holds, it's more likely (idk by how much, but I think by enough) to be widely known -> a lot less interesting, so you don't hear about it.

As an example, I run a survey on my blog, Ex-Translocated, with a thousand readers, a significant portion of which come from the rationality community. I have 9 innocuous correlations I'm measuring which give me exactly the information that common sense would expect, and one correlation between "how much time have you spent consuming self-help resources?" and "how much have self-help resources helped you at task X?" which is way higher than what common sense would naively expect. The rest of my correlations are boring and nobody hears about them except for my 1,000 readers, but my last correlation goes viral on pseudoscience Twitter that assumes this generalises to all self-help when it doesn't and uses it to justify actually unhelpful self-help. (If you feel the desire to nitpick this example you can probably generate another.)

I agree that this doesn't mean one ought to dismiss every such correlation out of hand, but I feel like this does mean that if I hear about an interesting survey result's or psych study's correlation in a context where I didn't also previously hear about the survey/study's intention to investigate said correlation (this doesn't just require preregistration because of memetic selection effects), I should ignore it unless I know enough to speculate as to the actual causal mechanisms behind that correlation.

This pretty much just bottoms out in "either trust domain experts or investigate every result of a survey/every study in the literature" which seems about right to me. So when someone e.g. criticises Aella for trying to run a survey at all to figure things out, that's silly, but it's also true that if one of Aella's tweets talking about an interesting result goes viral, they should ignore it, and this does seem like the actual response of most people to crazy-sounding effects; if anything, people seem to take psych studies too seriously rather than not taking random internet survey results seriously enough.

Expand full comment

It's a good point. But it seems this applies equally to psych studies, so it doesn't weaken Scott's point that we shouldn't single out internet surveys as invalid.

Expand full comment

Like any kind of bias, selection bias matters when the selection process is correlated with BOTH the independent and dependent variables and as such represents a potential confounder. Study design is how you stop selection bias from making your study meaningless.

Expand full comment

The way I think about the key difference here (which I learned during some time doing pharma research, where this kind of issues are as bad as... well) is that when claiming that a correlation doesn't generalize, some of the *burden of proof* shifts to the person critizicing the result. Decent article reviewers were pretty good at this: giving an at least plausible-sounding mechanism by which when going to a different population there's som *additional* effect to cancel/revert the correlation. It's the fact that the failure of correlation requires this extra mechanism that goes against Occam's Razor.

Expand full comment

It's not about correlations, it's about the supposed causal mechanism. Your Psych 101 sample is fine if you are dealing with cognitive factors that you suppose are universal. If you're dealing with social or motivational ones, then you're perhaps going to be in danger of making a false generalization. This is particularly disastrous in educational contexts because of the wide variety of places and populations involved in school learning. It really does happen all the time, and the only solution is for researchers to really know the gamut of contexts (so that they realize how universal their mechanisms are likely to be) and make the context explicit and clear instead of burying it in limitations (so that others have a chance to catching them on an over-generalization, if there is one). Another necessary shift is for people to simply stop looking for universal effects in social sciencies and instead expect heterogeneity.

Expand full comment

“But real studies by professional scientists don’t have selection bias, because . . . sorry, I don’t know how their model would end this sentence.”

...because they control for demographics, is how they’d complete the sentence.

Generically, we know internet surveys are terrible for voting behavior. Whether they’re good for the kinds of things Aella uses them for is a good question!

I’m on the record in talks as saying “everything is a demand effect, and that’s OK.” I see surveys as eliciting not what a person thinks or feels, but what they are willing to say they think and feel in a context constructed by the survey. Aella is probably getting better answers about sexual desire (that’s her job, after all!) and better answers on basic cognition. Probably worse on consumer behavior, politics, and generic interpersonal.

Expand full comment
author

Here is a randomly selected study from a top psychiatry journal, can you explain in what sense they are "controlling for demographics"? They have some discussion of age but don't even mention race, social class, etc.

https://ajp.psychiatryonline.org/doi/10.1176/appi.ajp.2020.19080886

Expand full comment

That’s an RCT, not a survey, and it’s probably more useful for them to run with the selection effects that get people into the office, rather than attempt to determine what would work for an unbiased sample of the population at large.

Expand full comment
author

I'm not sure why the RCT vs. survey matters for this purpose. Randomization only guarantees that people don't have extra confounders aside from the ones that brought them into the study, it doesn't address selection bias in getting into the study itself.

If I did an RCT of ACX readers, where I artificially manipulated the mental health of one group (by giving them addictive drugs, say), and then tested which group had more spiritual experiences, they would still be ACX readers, different from the population in all the usual ways.

Nor does it have to do with "getting them into the office". You find the exact same thing in nonclinical psychology studies - for example, can you find any attempt to control for demographics or extend out of sample in https://asset-pdf.scinapse.io/prod/2001019597/2001019597.pdf (randomly selected study that appeared when I Googled "implicit association test", a randomly selected psych construct that came to mind).

Maybe it would be more helpful if you posted a standard, well-known psychology paper that *did* do the extension out of sample as a routine part of testing a psychological construct. I think I've never seen this and would be interested to know what you're thinking of.

Expand full comment

I’m kind of going study by study as you post them. The RCT you suggest would be great if you were proposing a treatment for ACX readers, but I think you’d agree that results would be much weaker if you were proposing to extend to a large inner city hospital. Correlations in one population can reverse in another, famously from collider bias!

An example from Aella: I would not be surprised if Aella finds a negative correlation between kinks that is positive in the general population. This would be driven by a non-linearity; mildly kinky people are in to lots of things—so positive correlation between kinks in general population. Very kinky people have an obsession that excludes others (so weak or negative, in a survey that selects for people by interest in a sex worker feed.)

In an academic psych talk, a query about selection bias might receive the response “we corrected för demographics”, or it might get the response “what confound do you have in mind?” In political science I think the demographic question is more salient, because they have a (rather imaginary) “polis” in mind.

IMO the most interesting questions involve an interaction between group and individual, so the most interesting work is asking “what happens for people in group X”, and talus about both the cognitive mechanisms and how they interact with the logic of the group.

Expand full comment
author
Dec 28, 2022·edited Dec 28, 2022Author

Sorry, I do want to stick with the original thing we were discussing, which is whether most real scientists control for demographics. As far as I can tell, this still seems false. Do you still believe it is true? If so, can you give me examples? If we now agree this is false I'm happy to move on to these other unrelated points.

Expand full comment

I'm a psychology PhD and in my experience at least ~1/4 of the time reviewers will ask you to control for basic demographics like gender/race/age. (And lots of papers will throw this in there as a supplementary analysis.) If this seems incongruent with other people's experience, I can pull up a few random papers & see if I'm right...

And I agree with Simon that "did you control for gender/race/etc?" or "does this differ across genders/etc?" are common questions in talks.

Expand full comment

No apology necessary! You’re asking a science-of-science question which is best settled not by our anecdotes, but a quantitative survey of the literature. My suggestion, ironically enough, is to control, if you do such a survey, for selection bias.

Expand full comment

If you did the RCT of ACX readers and you found that giving addictive drugs led to more spiritual experiences, you could be pretty confident that it was true that the type of people who read ACX are more likely to have mystical experiences if they take addictive drugs. You'd then have to wonder whether this is only true for the type of people who read ACX, but you'd at least have solid ground on which to generalize from.

Without the RCT, you might find a relationship between drug use and spiritual experiences because drug use and spiritual experiences influence readership of your blog. In that case, your correlation might not even hold for "the type of people who read ACX"--just for actual ACX readers.

You're a psychiatrist who writes about drugs, ad you're also an irreligious rationalist. Let's suppose that your irreligious rationalism makes you less appealing to people who are more likely to be spiritual, but your drug writing makes you more appealing to people who are more into drugs. Together, that might mean that spiritual experience people who don't like drugs tend to stop reading your blog, but spiritual experience people who like drugs stick around, while non-spiritual people stick around regardless of whether they like drugs.

If so, you'd end up finding a positive correlation between drugs and spiritual experiences when looking at blog readers, even if that correlation doesn't exist for people with low interest in spirituality (as a group), or for people with a high interest in drugs (as a group), or people with a high IQ, or any other way of characterizing your readership. The correlation would be entirely an artifact of the selection process into becoming a blog reader.

Expand full comment
Dec 28, 2022·edited Dec 28, 2022

I think a bit part of the argument is "what is a spiritual/mystic experience?"

Tweeter didn't define it in that tweet, and that does make a difference.

I have had profound feelings of awe and gratitude at the beauty of the universe, but I would not define those as "mystic". Now, if Tweeter does say "but that is a mystic experience!", then we can begin to arrive at some kind of definition: a 'healthy' mind will have feelings of 'more than the usual grind or the rat-race'.

So everyone who has had the oceanic feeling can agree that they have had it, whether or not you define that as spiritual/mystic, and *then* we can ask "so how is your mental health?" and correlate one with the other.

If "excellent mental health" and "regularly experience of profundity" go together significantly, then Tweeter has made their case.

As it is, it's just more "Eat Pray Love" tourism showing-off about being *so* much finer material than the common clay normies.

The Hopkins poem "God's Grandeur" speaks to me, but that is because (1) we're co-religionists so I get where he's coming from and (2) I too have had experiences of the beauty of the world, despite all the evil and pain and suffering, but I would not necessarily call those mystic or spiritual:

https://www.poetryfoundation.org/poems/44395/gods-grandeur

Expand full comment

Nope, that matters enormously.

RCT vs. survey does make a critical difference as soon as you are secretly thinking about some causal nexus the correlation might somehow suggest even if not quite imply, so bacically always.

With an extra assumption that the causal mechanisms work the same for the entire population, the RCT on the unrepresentative sample is actually good evidence for some kind of causal connection in the underlying population. And then of course you get to speculate on direction of causality, common causes, multiausality and so on.

On the other hand *even with such an assumption* a sample unrepresentative in one of the correlated variables will change correlations enough to make them basically non-evidence for causal stories in the underlying population rather than the already weak evidence they would be in a representative sample. So "convenience sample" is a fairly weak objection to an RCT and a pretty fatal one to a correlationaly survey.

(That said, of course you will also find correlational studies on convenience samples in literature. In that case I think the conclusion should not be phrased as "Aella polls as good as science" but rather as "Lots of 'science' no better than Aella polls, this is part of why replication crisis". Logically the same, but correctly identifies which side of the false distinction people are wrong about).

Expand full comment

>And then generalize further to the entire world population over all of human history, and it stops holding again, because most people are cavemen who eat grubs and use shells for money, and having more shells doesn’t make it any easier to find grubs.

I know this is somewhat tongue-in-cheek, but for accuracy's sake: the number of people who were born before widespread adoption of agriculture was on the order of 10 billion, vs. about 100 billion after. https://www.prb.org/articles/how-many-people-have-ever-lived-on-earth/

Expand full comment

Ah but you are only considering the past and present, not the whole of human history.

Expand full comment

If the future of humanity involves a total collapse of civilization including the loss of all agricultural capacity such that it's no longer possible to exchange wealth or non-food labour for food, then the human population will rapidly collapse to match the Earth's carrying capacity for hunter-gatherers.

Even under the generous assumption that the carrying capacity after this unspecified apocalypse is not lower than it was in the past, it would take on the order of a million years of hunter-gatherer life without anyone reinventing agriculture for the cumulative hunter-gatherer population to match the cumulative agricultural-industrial population.

Expand full comment

What if humanity ends up colonizing other planets? Assume a scenario where interstellar space travel is 1. possible, 2. extraordinarily expensive and time-consuming, to the point where it's always a one-way trip and there's no transportation or communication between human worlds. Many of these worlds could end up reverting to primitivism, especially if the resources necessary to develop modern technological devices are scarce there. In that situation, each individual world's population would be much lower than modern-day Earth, but the overall population of humans scattered across the galaxy would be far higher than Earth's population alone.

This is an extraordinarily unlikely and ridiculously contrived scenario, I know. But it's one way that Melvin's statement could end up being accurate, and I've seen plenty of sci-fi universes built on some variant of this premise.

Expand full comment

My prior on "there exist a large number of planets where humans can subsist in viable numbers as hunter-gatherers but cannot establish even rudimentary agriculture, and the total hunter-gatherer carrying capacity of these planets is higher than the total agricultural-industrial carrying capacity of all the planets suitable for agriculture" is...basically zero.

Like I don't even know what it would mean for a planet to be suitable for hunter-gatherer lifestyles but not for any kind of agriculture. There's a reliable, sustainable supply of nonpoisonous organic matter with adequate micronutrient content for humans, but you can't breed/cultivate it and you can't use it as a planting medium? What would that even look like?

Expand full comment

>Like I don't even know what it would mean for a planet to be suitable for hunter-gatherer lifestyles but not for any kind of agriculture.

There are plenty such regions on Earth. Permafrost and polar caps. Dense forests, especially jungle, without the means to clear it. Basically the opposite gradient of arable land:

https://en.wikipedia.org/wiki/Arable_land

Expand full comment

Polar caps are not suitable for agriculture, and they also aren't suitable for hunting-gathering. A population there will just die.

Permafrost is suitable for hunting-gathering, and it's not suitable for agriculture. But hunting-gathering isn't going to happen, because the permafrost is also suitable for pastoralism and the pastoralists will easily defeat the hunter-gatherers due to their vastly superior numbers. https://en.wikipedia.org/w/index.php?title=S%C3%A1mi_people

Dense jungle is suitable for hunting-gathering and for agriculture. There is no such thing as not having the means to clear it; doing so is well within the means of hunter-gatherers. There is only jungle that no one has yet bothered to clear.

Expand full comment

In addition to what Michael Watts said:

Those are regions, not planets. The 'single-biome planet' exists only in fiction; neither the physics nor the biology of it works in reality. Any real planet will have a gradient of climates and interdependent biomes. If the warmest of them is permafrost, it won't resemble Earth's arctic region, which is full of organisms that evolved in warmer conditions and then adapted to the cold. It also most likely won't have an atmospheric composition hospitable to humans.

Also, importantly, there are many kinds of agriculture in the broad sense I'm using the term that don't require arable land. Pastoralism often works where farming doesn't. There's also aquaculture (which would be perfectly viable in the Arctic), greenhouses, hydroponics, algae vat farms, and more. These aren't all *economically* viable at scale in a world with huge amounts of arable land, but it's important not to confuse "unprofitable" with "impossible."

Expand full comment

Maybe the people of the future are technologically advanced but they eat grubs because the government has outlawed all other foods for environmental reasons.

And they use shells as currency because repeated financial crises have shown both fiat and crypto currencies to be unreliable.

Expand full comment

In that scenario they'd have grub farms and grub-harvesting specialists who would accept shells (or some other currency) in exchange for grubs.

(My correction wasn't about the specifics of the food or the currency - I understand "grubs and shells" to be metonyms for all the foods and currencies that hunter-gatherers might use - but about the economic circumstances where someone wouldn't be able to exchange currency for food.)

Expand full comment

Unless food is severely rationed by the government and money is only useful for buying NFTs and new hats for your avatar.

Expand full comment

Severe food rationing by the government results in the government becoming food, and then everything stabilizes again.

Expand full comment

Unless the future involves the birth of -90 million humans (or 90 million anti-humans?), these numbers can only get worse for Scott's claim by including the future.

Expand full comment

Yeah but you also have to count the billions of people ordering paleo GrubHub with TurtleCoins.

Expand full comment

I am a professor of political science who does methodological research on the generalizability of online convenience samples. The gold standard of political science studies is indeed *random population samples* -- it's not the whole world, but it is the target population of American citizens. Yes this is getting harder and harder to do and yes imperfections creep in. But studies published in eg the august Public Opinion Quarterly are still qualitatively closer to "nationally representative" then are convenience samples, and Scott's flippancy here is I think a mistake.

My research is specifically about the limitations of MTurk (and other such online convenience samples) for questions related to digital media. My claim is that the mechanism of interest is "digital literacy" and that these samples are specifically biased to exclude low digital literacy people. That is, the people who can't figure out fake news on Facebook also can't figure out how to use MTurk, making MTurk samples almost uniquely bad for studying fake news.

(ungated studies: http://kmunger.github.io/pdfs/psrm.pdf

https://journals.sagepub.com/doi/full/10.1177/20531680211016968 )

This post is solid but it doesn't emphasize enough the crucial point: "If you’re right about the mechanism...". More generally, I think that there are good reasons that Scott's intuitions ('priors') about this are different from mine: medical mechanisms are less likely to be correlated with selection biases than are social scientific mechanisms.

There is a fundamental philosophy of science question at stake here. Can the study of a convenience sample *actually* test the mechanism of interest? As Scott says, there is always the possibility of eg collider bias (the relationship between family income and obesity "collides" in the sample of college students).

So how much evidence does a correlational convenience sample *actually* provide? This requires a qualitative call about "how good" the sample is for the mechanism at issue. And at that point, if we're making qualitative calls about our priors and about the "goodness" of the sample....can we really justify the quantitative rigor we're using the in the study itself?

In other words: should a study of a given mechanism on a given convenience sample be "valid until proven otherwise"? Or "valid until hypothesized otherwise"? Or "Not valid until proven otherwise"? Or "Not valid until hypothesized otherwise"?

Expand full comment

For psych (as opposed to poli sci) you’re looking for reasonably robust mechanisms that can survive restriction to weird subpopulations. If you’re describing some cognitive bias X, you want it to be present even if you restrict only to (say) “high digital literacy Democrats”.

The debates are (in other words) really specific to field. Trying to explain voting behavior is really different from studying (say) risk perception.

Expand full comment

"Scott's flippancy here is I think a mistake."

That's what I came here to say. Glad someone smarter than me said it first.

Expand full comment

> and Scott's flippancy here is I think a mistake.

I was thinking that initially, but he did address that.

"b) hire a polling company like Gallup which has tried really hard to get a panel that includes the exact right number of Hispanic people and elderly people and homeless people and every other demographic"

Expand full comment

Is there a reason why you just wouldn't want to be somewhat specific with the headline of what you're publishing? So instead of "Study Finds Eating Bananas Raises IQ," you instead publish “Study Finds Eating Bananas Raises IQ in College Students," if they're all college students.

Expand full comment

You certainly can, and to a great degree good science involves judging where to place your title on the spectrum between "Eating Bananas Raises IQ for Everyone " and "Eating Bananas Raises IQ for Three Undergrads named Brianna and One Jocelyn." That said, one of the first things that often happen in pop science reporting is that these caveats get left out.

Expand full comment
author

Because "Study Finds Eating Bananas Raises IQ In College Students" is not, in fact, what you found. How do you know if a study in college freshman generalizes to college sophomores? If a study in Harvard students generalizes to Berkeley students? If a study in 2022 college students generalizes to 2025 college students.

You could title it "Study Finds Eating Bananas Raises IQ In This One Undergraduate Berkeley Seminar Of 80% White People, 20% Hispanic People, Who Make Between $50K and $100K Per Year, and [so on for several more pages of qualifications]", but the accepted way to avoid doing that is just to have a Methods section where you talk about the study population.

Expand full comment

There does seem to be a fair line of reasonableness to this as what @DaveOTN pointed out, at least from my experience. Lots of papers I've been reading have some degree of specification in their title but not all of them. And then in the Methods sections, even more specificity is pointed out. Maybe this is more of a recent trend.

Expand full comment
Dec 28, 2022·edited Dec 28, 2022

Completely agreeing what Scott already answered, but to flesh it out a bit more.

Sometimes you see (mostly old) papers titled something like "x does y in women", and what that title really does, at least to me, is to raise an assumption that this is somehow specific effect for women and not men. Now, I have seen enough of those studies to know that they almost never have tested that difference, they just only happened to have female participants.

But still it sounds completely odd. It is correct in a way, but I think a title of a study, or well, anything, is something you are not supposed to read completely literally. It tries to convey as much information in as few words as possible, and if you include something in it, the reader will assume it is of importance. And this is good. We need titles, and we need them not to say every detail of the study, otherwise they would be useless. So to add in to your title some detail like that, you are really communicating to the reader something else than just the fact of the study population.

So yes, I think there is a good reason you don't want that in your headline. A possible compromise would be something like "Eating banana raises IQ: study on college students", in case you want to stress the fact of college-student-participants a bit, but do not want to create false assumptions. But usually it's not worth it.

Expand full comment

I think the important issue is whether the selection bias is plausibly highly correlated to the outcomes being measured. I think the reason ppl scream selection bias about internet polls is that frequently participation is selected for based on strong feelings about the issue under discussion.

So if you are looking for surprising correlations in a long poll (as u do with your yearly polls) that's less of an issue but the standard internet survey tends to be in a situation where the audience can either guess at the intended analysis and decides to participate based on their feelings about it or is a situation where they are drawn to the blogger/tweeter because of similar ways of understanding the world so is quite likely to share whatever features of the author prompted them to generate the hypothesis.

Choosing undergrads based on a desire for cash is likely to reduce the extent of these problems (unless it's a study looking at something about how much ppl will do for money).

Expand full comment

Real scientists control for demographic effects when making generalizations outside the specifics of the dataset used. I'm confused why this article doesn't mention the practice - demographic adjustments are a well-understood phenomenon and Scott would have been exposed to them thousands of times in his career. And honestly, I think an argument can be made that the ubiquity of this practice in published science but its absence in amateur science mostly invalidates the thesis of this article, and I worry that Scott is putting on his metaphorical blinders due to his anger at being told off in his previous post for making this mistake.

This article does not feel like it was written in the spirit of objectivity and rationalism - it feels like an attempt at rationalization in order to avoid having to admit to something that would support Scott's outgroup.

Expand full comment
author

I have no idea what you're talking about. I have been reading psychology and psychiatry studies for years and have never seen them do this.

Here are some studies from recent issues of the American Journal of Psychiatry, one of the top journals in the field. Can you show me where they do this?

https://ajp.psychiatryonline.org/doi/10.1176/appi.ajp.2020.19080886

https://ajp.psychiatryonline.org/doi/10.1176/appi.ajp.20220456

https://ajp.psychiatryonline.org/doi/10.1176/appi.ajp.21111173

Expand full comment

pewresearch.org/our-methods/u-s-surveys/frequently-asked-questions/

From the article:

"...To ensure that samples drawn ultimately resemble the population they are meant to represent, we use weighting techniques in addition to random sampling. These weighting techniques adjust for differences between respondents’ demographics in the sample and what we know them to be at population level, based on information obtained through institutions such as the U.S. Census Bureau."

My apologies for assuming you were already familiar with this concept.

Expand full comment

Opinion polling by a think tank is not most people’s central example of “real scientists”. Is “real pollsters” maybe the category you have in mind? I have have no personal info one way or the other on whether psych researchers at universities and other scientific research institutions do what Scott says, but I don’t think Pew Research is a very useful counter-example.

Expand full comment

One gets the impression you didn't read the essay before commenting. Right in bold near the top it says this:

"Selection bias is disastrous if you’re trying to do something like a poll or census."

Scott is talking about a different kind of research, one that is *not* a poll or census, which is *not* attempting to say "x% of the people in [some large group] have [characteristic], based on measurement of [some small group]."

As I understand it, he is talking about the difference between measuring a distribution and measuring a correlation function. A distribution says "x% of the population has this characteristic." A correlation says "if a member of a population has characteristic Y, then the probability that he also has characteristic Z is x%." They are two very distinct kinds of measurement, and so far as I can tell, he is correct that to a first approximation you can at least test for the existence of correlations on any subset of the distribution without worrying a great deal about your sample well representing the overall distribution.

There are certainly weird edge cases where this would not be true, but that doesn't mean the general rule is unreasonable.

Expand full comment

This actually points (as does the whole article), at a much more interesting point: the belief in “proper official people who are doing everything right” vs “random cranks larping at [science/politics/law/history].” I suspect this is the intuition behind people saying “selection bias:” they want a reason why some things are proper official science and some things aren’t.

Anyone who works in any field is aware that they’re staying ahead of the cranks through cumulative experience and sharing their homework, but there are no grown-ups and no-one has access to special doing-things-properly methods.

Expand full comment

I read Scott's post as the emperor, and lots of other people, are barely dressed and that's ok. The replies read like imperialist counter claims, but seem lacking in evidence, i.e. proper official studies that select from a spread of population with lower bias than psych students or poor/bored folks.

Nothing against imperialiststs

Expand full comment

Yep. This thread on TheMotte is relevant: https://www.themotte.org/post/221/culture-war-roundup-for-the-week/41477?context=8#context

> Aella recently made an online survey about escorting and posted a chart on Twitter. It shows monthly earnings binned by BMI and clearly depicts that escorts with lower BMI making more on average than escorts with higher BMI. I would not have thought anybody would be surprised by that. The comments under the post proved me wrong.

> Christ almighty, I had no idea that there are so many statistically literate whores around just waiting to tell you your survey is bad. I also wasn't aware that escorts advertise their services so openly on social media.

> The number of escorts, both slim and not so slim, calling her out with little to no argument is mind blowing. The arguments they do give basically amount to sample size too low, BMI isn't real or "your survey is bad, and you should feel bad". Some of them also appear to lack reading comprehension. (...) Some give the argument that they themselves have high BMI but earn way more than that, and therefore the survey result must be wrong. Averages are seemingly a foreign concept to some.

> **A few are asking what Aella's credentials are or whether the survey has been reviewed by an ethics committee, as if you need any of that to do a random google forms survey on the internet. They appear to believe that ethics committees are to protect people who might find the result offensive and not the participants of the study.**

Expand full comment

This surely depends on the field and the questions asked, but I can assure you that a large portion of psych studies do not do that. And (most of the time) that is completely fine (as argued in the main post).

Expand full comment

(1) It's also worth noting that you can do a lot of sensitivity tests to see how far the results within your sample appear to be influenced by different subgroups which can help indicate where the unrepresentativeness of your sample might be a problem. IIRC the EA Survey does this a lot. This also helps with the question of whether an effect will generalise to other groups or whether, e.g. it only works in men.

Of course, this doesn't work for unobservables (ACX subscribers or Aella's Twitter readers are likely weird in ways that are not wholly captured by their observed characteristics, like their demographics).

(2) I think you are somewhat understating the potential power of "c) do a lot of statistical adjustments and pray", which understates the potential gap between an unrepresentative internet sample which you can and do statistically weight and an unrepresentative internet sample (like a Twitter poll) which you don't weight. Weighting very unrepresentative convenience samples can be extremely powerful in approximating the true population, while Twitter polls are almost always not going to be representative of the population.

Expand full comment

Seems like a good argument for rejecting studies done on Psych 101 undergrads, not for accepting surveys done on highly idiosyncratic groups of blog readers.

Expand full comment

I would agree with that.

I think that may be a bridge too far, especially since I somebody trained in good methodology could look at a dataset with care to try to balance out and offset these risk factors. (note: they may not do it, but the evidentiary value of biased data is not zero)

Just..... this is really after the Elon Twitter surveys, and for people who are used to idiosyncratic group surveys in other context. Surveys on the Fox News website, or those performed by the RNC on their likely voters also would & have clear biases, even if the questions were worded in an unbiased manner.

Expand full comment

Yeah. My baseline prior for *all* psych and sociology studies is "more likely than not utter garbage" unless the effect size is *huge*. And even then it's "probably utter garbage". Those done on Psych 101 undergrads start at "almost absolutely utter garbage. And internet polls, *especially* of "social media followers" are in that same bucket. Too many uncontrolled variables that are very likely to correlate strongly with the effect under study.

And lest you think I'm particularly biased there, my baseline for *hard physics* studies is "50% chance of being utter garbage."

Basically, almost all science is utter garbage. But some is more often utter garbage. And both "surveys of internet followers" and "Psych 101 undergraduate studies" are in the "don't even bother looking further except for amusement" bucket for me.

And @Scott--even in correlations, bias matters strongly. Giving the banana study to Mensa members means that your effect, if any, is out there in the part of the curve that we can't measure very well at all. Measuring the difference in IQ past a standard deviation or two is basically just noise *anyway*, so trying to correlate that with banana consumption is just noise squared.

Expand full comment

Care to explain your baseline on hard physics studies? IME it's pretty rare for an experimental paper to make false positive claims, though plenty are weaker than they should be or testing hypotheses that were probably not worth wasting time on.

Expand full comment

It doesn't have to be outright false to be utter garbage. It just has to fail to say anything meaningful. It could be 100% true, 100% valid...and still be utter garbage such that the writer and the world would have been better off if it hadn't been done (ie was a waste of resources). And not just experimental work--I'm including all the theory. This is based on my own training--I have a PhD in computational quantum chemistry. Plus my usual jaundiced eye--I'm a firm believer in Sturgeon's Law (90% of everything is crap). So a 50% "crap rate" is actually doing much better than normal.

Expand full comment

Or, instead of employing the binary of reject/accept to whole categories of studies, one may wish to adopt a more nuanced, Bayesian, approach. Like, isn't this whole blog basically about weak evidence being evidence too?

Expand full comment

The underlying phil of sci question is to what extent are you justified in believing your sample is representative for the question you are testing. It's generally understood that the "we test on people we can rope into our studies" is a problem that generates potential bias that can undermine representativeness for a general conclusion, but I think the article is far too flippant about the amount of effort that goes into this kind of question when psychologists are drawing inferences (or failing to do so) as compared to amateur Internet polls. It's flattening the distinction between a known problem that exists to varying degrees and is handled with varying degrees of respectable response and throwing your hands up in the air.

Expand full comment

I wouldn't settle for so simple heuristic, as any non-expert of those fields can do much better than this simply by just asking whether the study-question sounded plausible (as the studies on people trying to predict which studies replicate have shown). The studies done on psych101-students are probably not much worse than the other ones, as the main reason for bad studies is not the sample but the p-hacking etc.

If you want simple heuristic, it's more like "boring psych results" -> true, "surprising and interesting psych results" -> false.

Conflict of interest: I'm doing boring research.

Expand full comment
Dec 28, 2022·edited Dec 28, 2022

People's internal sense of plausibility is informed by their cultural beliefs about folk psychology, which in turn are influenced by pop psychology. This sometimes transforms what maybe should be thought of as surprising, interesting idea into something boring. False memory research was in vogue when I was a psych student. This had a lot of sexy results that called into question "repressed memories" whose reality for the public had become a rather conventional, boring belief.

Expand full comment

Since someone evaluating a claim can never know how many polls didn't show interesting results so both the fact that real world surveys are much more expensive to conduct and have fewer variables under control of the survey giver (accepted practice isn't to say what the UG is coming in for and cash is primary motivator in all of them) is a very strong justification for treating online polls as less reliable.

In some sense the real selection bias is the selection bias in terms of what polls you haven't heard about but it's a good reason. Though it leads to an interesting epistemic situation where the survey giver may have no reason to doubt their poll more than the academic polling UGs but those they inform about it do.

Expand full comment

What you’re describing is not unique to amateur internet studies. The term File Drawer Effect refers to the exact phenomenon you describe, but in officially real science.

Expand full comment

Yes, I'm aware of that, but things like the cost of running in person surveys, IRB approval etc meams the problem is orders of magnitude worse for online surveys.

As I suggest in another comment if each poll was accompanied by a certain sized charitable donation it might help make them comparable.

Expand full comment

> It doesn’t look like saying “This is an Internet survey, so it has selection bias, unlike real-life studies, which are fine.”

Eh, this seems like a highly uncharitable gloss of the concern. I would summarize it more as "Selection (and other) biases are a wicked hard problem even for 'real-life' studies that try very hard to control for them; therefore, one might justly be highly suspicious of internet studies for which there were no such controls."

One good summary of the problem of bias in 'real-life' studies: https://peterattiamd.com/ns003/

The issue is always generalization. How much are you going to try to generalize beyond the sample itself? If not all, then there is no problem. But, c'mon, the whole point of such surveys is that people do want to generalize from them.

Expand full comment
author

They're not a hard problem for real-life studies! Most people just do their psychology experiments on undergraduates, and most of the time it's fine! Most drug trials are done in a convenience sample of "whoever signs up for drug trials", and although there are some reasons you sometimes want to do better, it's good enough for a first approximation.

Expand full comment

Why do you say it's fine? It's publishable, sure. But it's not like this even led to a body of literature that's reproducible in the SAME unrepresentative population of freshman undergrads, let alone generalizes to tell us true facts about the world.

Expand full comment
author

Yes, I agree it had unrelated problems, which were not selection bias.

Expand full comment

What? Selection bias is definitely one of the issues that caused (/is still causing) the replication crisis. I definitely disagree that "most of the time it's fine", and I'm pretty surprised to see you, in particular, making that claim.

Expand full comment

“The replication crisis” normally describes ideas which aren’t true for *any* population. For example, psychology studies which tend to support the researcher’s favourite intervention don’t behave that way because different researchers use different sets of undergraduates (maybe there’s some “how-popular-is-this-method” effect, but I’d suspect the researcher effect would still apply between studies in the same university and year).

Something which is true of undergraduates but not of the general population would be interesting, and might get counted as part of the replication crisis, but it’s not a central example of the replication crisis.

Expand full comment
Dec 28, 2022·edited Dec 28, 2022

> What? Selection bias is definitely one of the issues that caused (/is still causing) the replication crisis.

I'm with Gres; the replication crisis was caused by having standards of publication that didn't even refer to whether the finding was true or false. Selection bias isn't an issue if your entire paper is hallucinated. There was nothing to select.

Expand full comment

The problem with this argument is that you have no evidence either way on the selection bias issue. We had a bunch of psychologists do a bunch of research that was hopelessly contaminated with selection bias. The selection bias didn't matter, because there were so many other problems with this body of research that it had no value at all, and therefore there was nothing for selection bias to ruin. If selection bias takes the value of a body of research from zero to zero, it hasn't hurt anything.

But you appear to be claiming that, if those other problems hadn't existed, the selection bias still wouldn't have been a problem. This is not obvious; maybe it would have been a big problem.

Expand full comment

There’s also a big body of psychology research which does replicate on both undergrads and the general population, and a much smaller body of research which replicates for undergrads but not for anyone else. Thus, selection bias is rarely a problem among good studies on undergrads.

Expand full comment

The problem has led to an entire literature of observational studies in nutritional epidemiology that is essentially worthless –– impossible to discern signal from noise in many cases.

Many drug trial are deeply compromised, not at all good enough for a first approximation. The book "Ending Medical Reversal" by Prasad and Cifu goes deep on this.

https://www.amazon.com/Ending-Medical-Reversal-Improving-Outcomes/dp/1421429047/

Expand full comment

I can't agree with your summary because what I'm reading in this article is an argument that the 'real-life' studies *don't* try very hard to control for selection bias (even if they acknowledge that it is a wicked hard problem) and so you should justly treat internet studies about as highly or lowly as other studies, because they have about the same level (lack of) controls for selection bias.

Expand full comment

The fact that academic studies are often terrible does not imply that internet studies are therefore OK. That's just an emotional reaction ("How come you pick on me when you don't pick on them?") It's totally possible for them both to be terrible –– and, as a matter of fact, I do also pick on them.

Expand full comment

So, this is kinda accurate, but I feel like you're underestimating the problems of selection bias in general. In particular, selection bias is a much bigger deal than I think you're realizing. The correlation coefficient between responding to polls and vote choice in 2016 was roughly 0.005 (Meng 2018, "Statistical Paradises and Paradoxes in Big Data"). That was enough to flip the outcome of the election. So for polls, even an R^2 of *.0025%* is enough to be disastrous. So yes, correlations are more resistant to selection bias, but that's not a very high bar.

Correlations are less sensitive, but selection effects can still matter a lot. As an example, consider that among students at any particular college, SAT reading and math scores will be strongly negatively correlated, despite being strongly positively correlated in the population as a whole: if a student had a higher score on both reading and math, they'd be going to a better college, after all, so we're effectively holding total SAT constant at any particular school.

So the question is, are people who follow Aella or read SSC as weird a population as a particular college's student body? I'd say yes. Of course though, it depends on the topic. For your mysticism result, I'm not worried, because IIRC you observe the same correlations in the GSS and NHIS--which get 60% response rates when sampling a random subset of the population. But I definitely wouldn't trust the magnitude, and I'd have made an attempt at poststratifying on at least a couple variables. Just weighting to the GSS+Census by race, income, religion, and education would probably catch the biggest problems.

Expand full comment
author

I think you're specifically selecting categories where this weird thing happens, and then saying we should expect it in other categories.

(also, polls are a bad example - because of Median Voter Theorem we should expect them to be right on the verge of 50-50, and so even small deviations are disastrous)

Expand full comment

I was selecting those as examples, but these are actually very common. Surveys in general are very sensitive to these problems unless you’re at least a bit careful to get good examples.

Maybe a good way to explain this is that seeing a correlation in a survey provides about as much evidence of correlation in the population, as seeing a correlation in the population provides of causality. This isn’t just a metaphor: there’s a very real sense in which selection biases are just backwards confounding. They’re often called “Inverted forks” in the causal inference literature—a “fork” being the classical confounder where you have one variable affecting two unrelated things. (If you imagine a diagram with lines going from cause to effect, a confounder has lines going to the two correlated variables, which looks like a two-tined fork if you’re sufficiently high and/or hungry.) A selection effect is the exact same, except flipping which variables are observed—you have two effects going into the same variable (e.g. mental illness and spirituality might both affect the probability of answering the SSC survey, in which case considering only responders creates a bias). Having to think backwards is a lot harder, so we intuitively imagine these problems must be rare, but they’re just as common as their flipped counterparts.

To be clear I’m not saying the survey data is useless; I’m guessing it’s right! But I definitely feel like this post isn’t urging sufficient caution. Pollsters put millions of dollars into trying to get representative samples, or reweighting responses to make the sample representative, and they *still* get correlations very wrong sometimes (e.g. underestimating the correlation of being black with the probability of voting for Walker in GA by a factor of 2).

I’d like to see commenters offering more reasons to expect the results could be wrong, but the presumption that the data here aren’t confounded is pretty weak, and we should be pretty uncertain about it—at least until we’ve tried to make the results somewhat more representative with weighting or poststratification.

Expand full comment

I just read the Meng (2018) paper because of your mention above.

It says that when using a non-random sample, a correlation between opting into the sample (by, for example responsing to a poll) and the variable being tested causes huge problems. The paper is specifically addressing this problem in the context of very large datasets or Big Data which are non-random and showing how the sample size in those cases doesn't improve predictive power.

The example presented to illustrate this is 2016 pre-election polling. In that case, the correlation between opting into the sample and actually voting for Trump (based on post-election results) was tiny but negative, namely people who were going to vote for Trump were slightly less likely to respond to the poll, and this caused the result of the poll to significantly underweight the Trump vote. And, of course, this didn't flip the outcome of the election, it flipped the outcome of the poll. The results of the election were based on the result variable, not the choice to respond.

Basically, I think this paper doesn't tell us much about relatively small samples from relatively small populations, like Scott's annual survey. Its issue is that the above correlation scales with the square of the population, so Big Data isn't all that great if it's not random.

Expand full comment

If this is about the last article your general point is correct but you polled a readership that's notoriously hostile to spirituality to determine if mental health correlates to spirituality. It'd be like giving Mensa folks a banana and measuring their IQ. You selected specifically for one of the variables and that's likely to introduce confounders.

Expand full comment
author

I'm worried we're still disagreeing on the main point, based on your example. To a first approximation, testing the correlation between banana-eating and IQ in a Mensa sample should still be fine. Everyone will have high IQ, but if the super-high-IQ people eat more bananas than the just-regularly-high IQ people, this could still support the hypothesis.

(the main reason you wouldn't want to do this is ceiling effects, I think).

Likewise, in a very-low-spirituality sample, you should still be able to prove things about spirituality. For example, I bet ACX readers will have more spiritual experiences on LSD than off of it, just like everyone else.

Also, less important, but I wouldn't describe this sample as "notoriously hostile to spirituality" - 40% said they had a spiritual experience or something like it, some of my most popular posts are ones on meditation and jhanas and stuff.

Expand full comment

I agree it's still evidence but it's evidence that should be treated with extreme skepticism and then more skepticism in case you were insufficiently skeptical the first time.

It's not certain that high IQ folks will have different reactions to IQ increasing techniques than the general population but it's more likely than not.

Expand full comment
Dec 28, 2022·edited Dec 28, 2022

This is true to first order, and probably true in most cases, but I suspect not all.

Suppose bananas increase variance of IQ without affecting the mean, and Mensa contains a random sample of people with IQ >140. Then Mensa banana-eaters would have higher IQs on average, even though banana-eater IQ in the general population would have average IQ of 100. I think this is the main reason people argue for smaller schools - the top 100 schools have more small schools than expected, because small schools have higher variance in scores.

To give a more relevant example, consider (my mental model of) crypto. I imagine crypto use has a U-shaped correlation with techiness, where normal people distrust it, moderately techy people like it, and very techy people distrust it again (this is probably wrong, but pretend it’s true for my example). Then in the general population, IT people would use more crypto than normal, but on ACX, IT people would use less crypto than the general population.

Probably 99% of surveys aren’t like this, but probably only 10% of surveys look like they might be like this, to a given observer. For that observer, those surveys should only be able to update their belief by p=0.1, unless they can be convinced that there’s actually less than a one-in-ten chance this survey has a nonlinearity like the second example.

Expand full comment
founding

You’re right each and every survey might have sampling bias, but whether that’s a problem ultimately all depends on what constitute a good model of reality. Would you like to see some code to prove this point?

In words: suppose banana-eating increase IQ but only if you *don’t* eat enough fish. Then, sampling high IQ means sampling high status, then more diversified food, then less impact of banana-eating, and you might miss it even if statistical power is good.

Or, suppose banana-eating increases IQ, but only if you *do* eat enough fish. Then, sampling high IQ means sampling high status, then more diversified food, then more impact of banana-eating, then failure to replicate in general population.

Expand full comment

This example is exactly the point I was trying to make. Except I don't think you need the causal parts at all. It doesn't matter that IQ is a proxy for status is a proxy for more diversified diet. If banana-eating increases IQ only if you *don't* eat enough fish, almost certainly Mensa folks who don't already eat a ton of bananas are already eating enough fish. And you know that because they have massively high IQs. By selecting "Mensa members who don't eat many bananas" you're virtually guaranteeing that you're selecting for every single confounder to your banana hypothesis, whether you're able to identify those confounders or not.

If you want to see whether spiritual experiences correlate with mental health, and your sample is "non-religious folks who are mentally healthy" then you've selected exactly the group that would confound the hypothesis, even if it's true.

To be fair, that's not *exactly* what happened with the survey, but it's close.

Expand full comment

> For example, I bet ACX readers will have more spiritual experiences on LSD than off of it, just like everyone else.

ACX readers - maybe. On LW, maybe not? LSD doesn't seem to just generate random beliefs out of nowhere. Unless stuff like ego death counts as spirituality.

Expand full comment

Going to Aella's tweet that was linked:

> using it as a way to feel superior to studies, than judiciously using it as criticism when it's needed

just because people use selection bias as a way to feel superior to studies doesn't mean that the study isn't biased in the first place

and

> But real studies by professional scientists don’t have selection bias, because...

ignoring the fact that professional studies control for selection bias, or at least have a section in the paper where the participants are specified, unlike twitter polls

Expand full comment
author

As I've said many times in this post, I challenge you to find these professional psychology and psychiatry studies that "control for selection bias". I think doing this would actually be extremely irresponsible without a causal model of exactly how selection into your study works. If you look at actual psych studies (eg https://asset-pdf.scinapse.io/prod/2001019597/2001019597.pdf and https://ajp.psychiatryonline.org/doi/10.1176/appi.ajp.20220456 , randomly chosen just so we have concrete examples, they don't do anything of the sort.

I agree they sometimes mention participant characteristics (although I think that psych study I linked doesn't even go so far as to mention gender, let alone class), but so does the SSC survey! I agree Twitter polls are extremely vulnerable to selection bias (especially since their polls), but my impression is that Aella also does more careful surveys.

Expand full comment

If someone's doing an RCT they're "controlling for selection bias." Their inferences are comparing a treatment group to a fundamentally similar control group.

What they're not doing is demonstrating or accounting for external validity. You're right that the best you can ordinarily expect is some description of the sample, plus perhaps a heterogeneity analysis or a reweighting of the sample to look like some population of interest.

But these are different problems, and the "selection bias" problem is a lot more fundamental than the "external validity" problem. If you have an unbiased study of a weird population, you're still measuring a real effect, and can think about how likely it is to generalize by thinking about the likely mechanisms of effect. If you have a study that's biased by the weirdness of your population, the correlation you measure might just be measuring how the factors you study affect people's likelihood of reading your blog, without any real relationship or real mechanism.

Expand full comment

Selection bias can and absolutely does break correlations, frequently. The most obvious way is through colliders (http://www.the100.ci/2017/03/14/that-one-weird-third-variable-problem-nobody-ever-mentions-conditioning-on-a-collider/) - but there's tons of other ways in which this can happen: the mathematical conditions that have to hold for a correlation to generalize to a larger population when you are observing it in a very biased subset are pretty strict.

Further: large sample sizes do help, but, they do not help very much. There is a very good paper that only requires fairly basic math that tackles the problem of bias in surveys: https://statistics.fas.harvard.edu/files/statistics-2/files/statistical_paradises_and_paradoxes.pdf (not - this is not specifically correlations, but the problem is closely related). Here is the key finding:

Estimates obtained from the Cooperative Congressional Election Study (CCES) of the 2016 US presidential election suggest a ρR,X ≈ −0.005 for self-reporting to vote for Donald Trump. Because of LLP, this seemingly minuscule data defect correlation implies that the simple sample proportion of the self-reported voting preference for Trump from 1% of the US eligible voters, that is, n ≈ 2,300,000, has the same mean squared error as the corresponding sample proportion from a genuine simple random sample of size n ≈ 400, a 99.98% reduction of sample size (and hence our confidence)

And keep in mind - this is in polling, which 'tries' to obtain a somewhat representative sample (ie, this sample is significantly less biased than a random internet sample).

Expand full comment

Looking at Aella's data & use of it, I don't have the same concerns I may have about the SSC survey used on religious issues.

So this chart, for example:

https://twitter.com/Aella_Girl/status/1607641197870186497

I am not aware of a likely rationale for these results to change by the selection effect, specifically on the axis studied. I may wave the selection effects concerns if the slope were the specific question, but not the presence of a slope.

Even further, I don't have a background where Aella is trying to turn a very messy & vague original problem statement into something to attempt to refute without providing a number of caveats.

It is valid to push back that selection effects are everywhere. It is valid to argue that SSC data has some evidentiary value, and that as good Bayesians we should use it as evidence. The tone of the post does not hit the right note not to have it rejected.

However, to push-back the push-back, I would seriously try to assess if you have a challenge in dealing with disagreements or challenges. Not trying to psychologize this too much, however, is this post actually trying to raise the discourse? Or is this post just trying to nullify criticism? Are you steelmanning the concern, or are you merely rebutting it?

Expand full comment
Dec 27, 2022·edited Dec 27, 2022

if you're talking about mental health issues and mystic visions, then I'm (1) religious (2) have had *very* few 'spiritual experiences' (about one, maybe two tops that I remember) (3) have *never* had the big flashy ones and (4) do think that the first question that should be asked about people reporting big flashy experiences is "are you nuts in the noggin?"

So, absolutely the audience on here is selected for people who aren't religious and wouldn't quantify experiences as "spiritual experiences". *But* it is also selected for people who have used all kinds of drugs, brain-hacking, and nootropics, so there's a very good chance they have had the 'mystic trip I was talking to entities' experiences. That they put those down to "yeah well drugs fuck your brain up so you have those kinds of visions" rather than "it absolutely was the spirit of the drug enlightening me to the cosmic secrets" makes me trust their reports more rather than less.

Even I don't think that everyone who claims they have regular weekly chats with Jesus, the Blessed Virgin, or God Almighty about the state of the world are having what they say they are having; they may be sincere but deluded (nuts in the noggin) or they may be fakes and fraudsters scamming people. People who are mostly sane and are having genuine experiences are not that common.

Expand full comment