Attempts To Put Statistics In Context, Put Into Context

...

Jun 08, 2023

Sometimes people do a study and find that a particular correlation is r = 0.2, or a particular effect size is d = 1.1. Then an article tries to “put this in context”. “The study found r = 0.2, which for context is about the same as the degree to which the number of spots on a dog affects its friskiness.”

But there are many statistics that are much higher than you would intuitively think, and many other statistics that are much lower than you would intuitively think. A dishonest person can use one of these for “context”, and then you will incorrectly think the effect is very high or very low.

In last week’s post on antidepressants, I wrote:

Consider a claim that the difference between treatment and control groups was “only as big, in terms of effect size, as the average height difference between men and women - just a couple of inches” (I think I saw someone say this once, but I’ve lost the reference thoroughly enough that I’m presenting it as a hypothetical). That drug would be more than four times stronger than Ambien!

But we can do worse. Studies find that IQ correlates with grades at about 0.54. Here are two ways to put that in context:

IQ determines less than 30% of the variance in grades; the other 70% is determined by other things.
IQ affects grades more than political affiliation (liberal vs. conservative) affects whether or not you like Donald Trump (on the ACX survey).

The first way makes it sound like IQ doesn’t matter that much; the second way makes it sound like it matters a lot.

Or suppose that you’re debating whether there’s such a thing as “general” intelligence, eg whether students who are good at reading are also good at math. People have studied this; here are two ways to describe the result:

The correlation between reading scores and math scores is lower than the correlation between which college majors have higher IQ, and which college majors have skewed gender balance.
The correlation between reading and math is higher than the correlation between which countries are near the equator, and which countries are hot.

The first effect might sound kind of trivial, but it is r = 0.86. And the second effect might sound immense, but it is only r = 0.64. The real correlation between standardized reading and math tests is in between, r = 0.72. The examples above might be sort of cheating, because they’re comparing college majors (which are averaged-out aggregates of people) to countries (which are just individual countries). But that’s my point. It’s easy to cheat!

Obviously someone wanting to exaggerate or downplay the generality of intelligence could choose which of these two ways they wanted to “put it into context”. I don’t have a solution to this except for constant vigilance and lots of examples.

So here are a lot of examples. I thought I was the first to do this, but partway through I found some prior art. None completely satisfied me, but I’ve stolen a little from all of them. Credit here: Meyer et al, Hattie on education, Reason Without Restraint, Leucht et al.

Some effect sizes and correlations are naturally misleading, or depend a lot on context. I’ve tried as hard as I can to avoid these and make all my examples clear, but they will necessarily require some charity.

Effect Size:

DARE keeps kids off drugs: 0.02
Single-sex schools improve grades: 0.08
Smaller class sizes improve grades: 0.21
SSRIs help depression: 0.4
Ibuprofen helps arthritis pain: 0.42
Women are more empathetic than men: 0.9
Oxycodone helps pain: 1.0
Smokers get more lung cancer than non-smokers: 1.1
Men commit more violent crime than women: 1.1
Men are more into engineering than women: 1.1
Adderall helps ADHD: 1.3
Men are taller than women: 1.7
Children tutored individually learn more than in a classroom: 2.0

Correlation:

Extraversion vs. holiday gift spending: 0.09
Extraversion vs. having more sex: 0.17
Political conservatism vs. happiness: 0.18
Brain size vs. intelligence: 0.19
Parent’s social class vs. child’s grades: 0.22
Political liberalism vs. concern about COVID: 0.33
Boss vs. coworker assessments of job performance: 0.34
Husband’s attractiveness vs. wife’s attractiveness: 0.4
High school GPA vs. SAT score: 0.43
Height vs. weight: 0.44
IQ vs. educational attainment: 0.44
Political conservatism vs. support for Trump (on ACX survey): 0.5
IQ vs. grades: 0.54
Latitude vs. temperature: 0.6
Depression vs. anxiety: 0.64
SAT verbal score vs. SAT math score: 0.72
Two different methods of testing arterial oxygen: 0.84
The same student’s score taking the SAT twice: 0.87

A statistician who read a draft of this post suggested throwing out these general statistical effect measures in favor of specific ones; not only does nobody know what they mean, but nobody cares what the “standardized effect size” of a treatment for depression is - they care how much less depressed they’re going to get.

I sort of agree, but I can see some use for these. We can’t express how much less depressed someone gets in comprehensible units - it will either be stuff like this or “5 points on the HAM-D”, which is even more arcane and harder for laymen to interpret. And when you get to questions like “how big is the gender difference in empathy?”, I can’t think of another equally clear way to express it.

Still, it’s hard to understand and easy to mislead with, so watch out.

Astral Codex Ten

222 Comments