Comment deleted
Expand full comment

Shouldn't "19:1 in factor" be actually "19:1 in favor"?

Expand full comment

Bayesian statistics require you to compare the chance of the outcome given the null hypothesis, Vs the chance of the outcome given h1.

So the chance of getting p=0.5 given the null hypothesis is very high, but very low given h1, so it should significantly update you towards h0.

Expand full comment

The key is to have a large enough study such that if there are 20 potentially relevant factors, even though one of them will probably show a significant difference between groups, that difference will be too small to explain any difference in results. Here the study was tiny, so one group had 25% high blood pressure and the other over 50%.

Expand full comment

On the topic of your point in part I, this seems like a place where stratified sampling would help. Essentially, you run tests on all of your patients, and then perform your random sampling in such a way that the distribution of test results for each of the two subgroups is the same. This becomes somewhat more difficult to do the more different types of tests you run, but it shouldn't be an insurmountable barrier.

It's worth mentioning that I've seen stratified sampling being used in some ML applications for exactly this reason - you want to make sure your training and test sets share the same characteristics.

Expand full comment

Were those really the only questions you considered for the ambidexterity thing? I had assumed, based on the publication date, that it was cherry-picked to hell and back.

Expand full comment

One comment here: if you have a hundred different (uncorrelated) versions of the hypothesis, it would be *hella weird* if they all came back around p=.05. just by random chance, if you'd expect any individual one of them to come back at p=0.05, then if you run 100 of them you'd expect them one of them to be unusually lucky and get back at p=.0005 (and another one to be unusually unlucky and end up at p=0.5 or something). Actually getting 100 independent results at p=.05 is too unlikely to be plausible.

Of course, IRL you don't expect these to be independent - you expect both the underlying thing you're trying to predict and the error sources to be correlated across them. This is where it gets messy - you sort of have to guess how correlated these are across your different metrics (e.g. high blood pressure would probably be highly correlated with cholesterol or something, but only weakly correlated with owning golden retrievers). And that I'm itself is kind of a judgement call, which introduces another error source.

Expand full comment

What I generally use and see used for multiple testing correction is the Benjamini-Hochberg method - my understanding is that essentially it looks at all the p-values and for each threshold X it compares to how many p-values<X you got vs how many you'd expect by chance, and adjusts them up depending on how "by chance" they look based on that. In particular, in your "99 0.04 results and one 0.06" example it only adjusts the 0.04s to 0.040404.

But generally speaking you're right, all those methods are designed for independent tests, and if you're testing the same thing on the same data, you have no good way of estimating how dependent one result is on another so you're sort of screwed unless you're all right with over-correcting massively. (Really what you're supposed to do is pick the "best" method by applying however many methods you want to one dataset and picking your favorite result, and then replicate that using just that method on a new dataset. Or split your data in half to start with. But with N<100 it's hard to do that kind of thing.)

Expand full comment

For a bayesian analysis, it's hard to treat "has an effect" as a hypothesis and you probably don't want to. You need to treat each possible effect size as a hypothesis, and have a probability density function rather than a probability for both prior and posterior.

You *could* go the other way if you want to mess around with dirac delta functions, but you don't want that. Besides, the probability that any organic chemical will have literally zero effect on any biological process is essentially zero to begin with.

Expand full comment

Not directly answering the question at hand, but there's a good literature that examines how to design/build experiments. One paper by Banerjee et al. (yes, the recent Nobel winner) finds that you can rerandomize multiple times for balance considerations without much harm to the performance of your results: https://www.aeaweb.org/articles?id=10.1257/aer.20171634

So this experiment likely _could_ have been designed in a way that doesn't run into these balance problems

Expand full comment

On the ambidextrous analysis, I think what you want for a Bayesian approach is to say, before you look at the results of each of the 4 variables, how much you think a success or failure in one of them updates your prior on the others. e.g. I could imagine a world where you thought you were testing almost exactly the same thing and the outlier was a big problem (and the 3 that showed a good result really only counted as 1.epsilon good results), and I could also imagine a world in which you thought they were actually pretty different and there was therefore more meaning to getting three good results and less bad meaning to one non-result. Of course the way I've defined it, there's a clear incentive to say they are very different (it can only improve the likelihood of getting a significant outcome across all four), so you'd have to do something about that. But I think the key point is you should ideally have said something about how much meaning to read across the variables before you looked at all of them.

Expand full comment

"I don't think there's a formal statistical answer for this."

Matched pairs? Ordinarily only gender and age, but in principal you can do matched pairs on arbitrarily many characteristics. You will at some point have a hard time making matches, but if you can match age and gender in, say 20 significant buckets, and then you have say five binary health characteristics you think might be significant, you would have about 640 groups. You'd probably need a study of thousands to feel sure you could at least approximately pair everyone off.

Expand full comment

I'm not convinced there is actually an issue. Whenever we get a positive result in any scientific experiment there is always *some* chance that the result we get will be random chance rather than because of a real effect. All of this debate seems to be about analyzing a piece of that randomness and declaring it to be a unique problem.

If we do our randomization properly, on average, some number of experiments will produce false results, but we knew this already. It is not a new problem. That is why we need to be careful to never put too much weight in a single study. The possibility of these sorts of discrepancies is a piece of that issue, not a new issue. The epistemological safeguards we already have in place handle it without any extra procedure to specifically try to counter it.

Expand full comment

You basically never want to be trying to base your analysis on combined P factors directly. You want to--as you said--combine together the underlying data sets and create a P factor on that. Or, alternately, treat each as a meta-data-point with some stdev of uncertainty and then find the (lower) stdev of their combined evidence. (Assuming they're all fully non-independent in what they're trying to test.)

Expand full comment

Good thoughts on multiple comparisons. I made the same point (and a few more points) in a 2019 Twitter thread: https://twitter.com/stuartbuck1/status/1176635971514839041

Expand full comment

A good way to handle the problem of failed randomizations is with re-weighting based on propensity scores (or other similar methods, but PSs is the most common). In brief, you use your confounders to predict the probability of having received the treatment, and re-weight the sample depending on the predicted probabilities of treatment. The end result of a properly re-balanced sample is that, whatever the confouding effect of blood pressure on COVID-19 outcomes, it confounds both treated and untreated groups with equal strength (in the same direction). Usually you see this method talked about in terms of large observational data sets, but it's equally applicable to anything (with the appropriate statistical and inferential caveats). Perfectly balanced data sets, like from a randomized complete block design, have constant propensity scores by construction, which is just another way of saying they're perfectly balanced across all measured confounders.

For the p-value problem, whoever comes in and talks about doing it Bayesian I believe is correct. I like to think of significance testing as a sensitivity/specificity/positive predictive value problem. A patient comes in from a population with a certain prevalence of a disease (aka a prior), you apply a test with certain error statistics (sens/spec), and use Bayes' rule to compute the positive predictive value (assuming it comes back positive, NPV otherwise). If you were to do another test, you would use the old PPV in place of the original prevalence, and do Bayes again. Without updating your prior, doing a bunch of p-value based inferences is the same as applying a diagnostic test a bunch of different times without updating your believed probability that the person has the disease. This is clearly nonsense, and for me at least it helps to illustrate the error in the multiple hypothesis test setting.

Finally, seeing my name in my favorite blog has made my day. Thank you, Dr. Alexander.

Expand full comment

> Should they have checked for this right after randomizing, noticed the problem, and re-rolled their randomization to avoid it?

The point of "randomizing" is to drown out factors that we don't know about. But given that we know that blood pressure is important, it's insane to throw away that information and not use it to divide the participant set.

I think the proper way to do this might be stratified sampling [1]. Divide the population into all relevant subgroups that you know about and then sample from each subgroup at the same rate to fill your two groups.

[1]: https://en.wikipedia.org/wiki/Stratified_sampling

Expand full comment

Regarding Bayes. If a test (that is independent of the other tests) gives a bayes a factor of 1:1, then that means that the test tells you nothing. Like, if you tested the Vitamin D thing by tossing a coin. It's no surprise that it doesn't change anything.

Expand full comment

1. Did someone try to aggregate the survival expectation for both groups (patient by patient, then summed up) and control for this?

Because this is the one and main parameter.

2. Is the "previous blood pressure" strong enough a detail to explain the whole result?

3. My intuition is that this multiple comparison thing is way too dangerous an issue to go ex post and use one of the test to explain the result.

This sounds counter intuitive. But this is exactly the garden of forked path issue. Once you go after the fact to select numbers, your numbers are really meaningless.

Unless of course you happen to land on the 100% smoker example.


You will need a really obvious situation, rather than a maybe parameter.

Expand full comment

Rerolling the randomization as suggested in the post, doesn't usually work because people are recruited one-by-one on a rolling basis.

But for confounders that are known a priori, one can use stratified randomization schemes, e.g. block randomization within each stratum (preferably categories, and preferably only few). There are also more fancy "dynamic" randomization schemes that minimize heterogeneity during the randomization process, but these are generally discouraged (e.g., EMA guideline on baseline covariates, Section 4.2).

In my naive understanding, spurious effects due to group imbalance are part of the game, that is, included in the alpha = 5% of false positive findings that one will obtain in the null hypothesis testing model (for practical purposes, it's actually only 2.5% because of two-sided testing).

But one can always run sensitivity analyses with a different set of covariates, and the authors seem to have done this anyway.

Expand full comment

FYI, you can next-to-guarantee that the treatment and control groups will be balanced across all relevant factors by using blocking or by using pair-matching. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4318754/#:~:text=In%20randomized%20trials%2C%20pair%2Dmatching,best%20n%2F2%20matched%20pairs.

Expand full comment

I think controlling for noise issues with regression is a fine solution for part 1. You can also ways of generating random groups subject to a restraint like "each group should have similar average Vitamin D." Pair up experimental units with similar observables, and randomly assign 1 to each group (like https://en.wikipedia.org/wiki/Propensity_score_matching but with an experimental intervention afterwards).

For question 2, isn't this what https://en.wikipedia.org/wiki/Meta-analysis is for? Given 4 confidence, intervals of varying widths and locations, you either: 1. determine the measurements are likely to be capturing different effects, and can't really be combined; or 2. generate a narrower confidence interval that summarizes all the data. I think something like random-effects meta analysis answers the question you are asking.

Expand full comment

You shouldnt just mindlessly adjust for multiple comparisons by dividing the significance threshold by the number of tests. This Bonferroni adjustment is used to "controll the familywise error rate",(FWER), which is the probability of rejecting one hypothesis, given that they are all true null hypotheses. Are you sure that is what you want to controll for in your ambidextrois analysis? Its not abvious that is what you want.

Expand full comment

My former employer Medidata offers software-as-a-service (https://www.medidata.com/en/clinical-trial-products/clinical-data-management/rtsm) that lets you ensure that any variable you thought of in advance gets evenly distributed during randomization. The industry term is https://en.wikipedia.org/wiki/Stratification_(clinical_trials)

Expand full comment

By the way: Thanks for mentioning the "digit ratio" among other scientifically equally relevant predictors such as amount of ice hockey played, number of nose hairs, eye color, percent who own Golden Retrievers.

Made my day <3

Expand full comment

The easy explanation here is that the number of people randomized was so small that there was no hope of getting a meaningful difference. Remember, the likelihood of adverse outcome of COVID is well below 10% - so we're talking about 2-3 people in one group vs 4-5 in the other. In designing a trial of this sort, it's necessary to power it based on the number of expected events rather than the total number of participants.

Expand full comment

I think you would want to construct a latent construct out of your questions that measures 'authoritarianism', and then conduct a single test on that latent measure. Perhaps using a factor analysis or similar to try to divine the linear latent factors that exist, and perusing them manually (without looking at correlation to your response variable, just internal correlation) to see which one seems most authoritarianish. And then finally measuring the relationship of that latent construct to your response variable, in this case, ambidexterity.

Expand full comment

I am now extremely confused (as distinct from my normal state of mildly confused), because I looked up the effects of Vitamin D on blood pressure.

According to a few articles and studies from cursory Googling, vitamin D supplementation will:

(1) It might reduce your blood pressure. Or it might not. It's complicated https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5356990/

(2) Won't have any effect on your blood pressure but may line your blood vessels with calcium like a furred kettle, so you shouldn't take it. Okay, it's kinda helpful for women against osteoporosis, but nothing doing for men https://health.clevelandclinic.org/high-blood-pressure-dont-take-vitamin-d-for-it-video/

(3) Have no effect https://www.ahajournals.org/doi/10.1161/01.hyp.0000182662.82666.37

This Chinese study has me flummoxed - are they saying "vitamin D has no effect on blood pressure UNLESS you are deficient, over 50, obese and have high blood pressure"?


"Oral vitamin D3 has no significant effect on blood pressure in people with vitamin D deficiency. It reduces systolic blood pressure in people with vitamin D deficiency that was older than 50 years old or obese. It reduces systolic blood pressure and diastolic pressure in people with both vitamin D deficiency and hypertension."

Expand full comment

You want to use 4 different questions from your survey to test a single hypothesis. I *think* the classical frequentist approach here would be to use Fisher's method, which tells you how to munge your p values into a single combined p: https://en.wikipedia.org/wiki/Fisher%27s_method

Fisher's method makes the fairly strong assumption that your 4 tests are independent. If this assumption is violated you may end up rejecting the null too often. A simpler approach that can avoid this assumption might be to Z-score each of your 4 survey questions and then sum the 4 Z-scores for each survey respondent. You can then just do a regular t-test comparing the mean sum-of-Z-scores between the two groups. This should have the desired effect (i.e. an increase in the power of your test by combining the info in all 4 questions) without invoking any hairy statistics.

Expand full comment

Btw. these folks are willing to bet $100k on that Vitamin D significantly reduces ICU admissions. https://blog.rootclaim.com/treating-covid-19-with-vitamin-d-100000-challenge/

Expand full comment

A few people seem to have picked up on some of the key issues here, but I'll reiterate.

1. The study should have randomized with constraints to match blood pressures between the groups. This is well established methodology.

2. Much of the key tension between the different examples is really about whether the tests are independent. Bayesianism, for example, is just a red herring here.

Consider trying the same intervention at 10 different hospitals, and all of them individually have an outcome of p=0.07 +/- 0.2 for the intervention to "work". In spite of several hospitals not meeting a significance threshold, that is very strong evidence that it does, in fact, work, and there are good statistical ways to handle this (e.g. regression over pooled data with a main effect and a hospital effect, or a multilevel model etc.). Tests that are highly correlated reinforce each other, and modeled correctly, that is what you see statistically. The analysis will give a credible interval or p-value or whatever you like that is much stronger than the p=0.05 results on the individual hospitals.

On the other hand, experiments that are independent do not reinforce each other. If you test 20 completely unrelated treatments, and one comes up p=0.05, you should be suspicious indeed. This is the setting of most multiple comparisons techniques.

Things are tougher in the intermediate case. In general, I like to try to use methods that directly model the correlations between treatments, but this isn't always trivial.

Expand full comment

On the 'what could go wrong' point- what could go wrong is that in ensuring that your observable characteristics are nicely balanced, you've imported an assumption about their relation to characteristics that you cannot observe- so you're saying the subset of possible draws in which observable outcomes is balanced is also the subset where unobservable outcomes is balanced, which is way stronger than your traditional conditional independence assumption.

Expand full comment

I think I understand the problem with your approach to combining Bayes factors. You can only multiply them like that if they are conditionally indepent (see https://en.wikipedia.org/wiki/Conditional_independence).

In this case, you're looking for P(E1 and E2|H), where E1, E2 are the results or your two experiments and H is your hypothesis.

Now, generally P(E1 and E2|H) != P(E1|H) * P(E2|H).

If you knew e.g. P(E1 | E2, H), you could calculate P(E1 and E2|H) = P(E1 | E2, H) * P(E2|H).

Expand full comment

I think that some of the confusion here is on the difference between the Family-Wise Error Rate (FWER) and the False Discovery Rate (FDR). The first is the probability under the null hypothesis of getting at least one false positive. The second is the expected proportion of false positives over all positives. Assuming some level of independent noise, we should expect the FWER to increase with the number of tests if the power of our test is kept constant (this is not difficult to prove in a number of settings using standard concentration inequalities). The FDR, however, we can better control. Intuitively this is because one "bad" apple will not ruin the barrel as the expectation will be relatively insensitive to a single test as the number of tests gets large.

Contrary to what Scott claims in the post, Holmes-Bonferroni does *not* require independence of tests because its proof is a simple union bound. I saw in another comment that someone mentioned the Benjamini-Hochberg rule as an alternative. This *does* (sometimes) require independent tests (more on this below) and bounds the FDR instead of the FWER. One could use the Benjamini-Yekutieli rule (http://www.math.tau.ac.il/~ybenja/MyPapers/benjamini_yekutieli_ANNSTAT2001.pdf) that again bounds the FDR but does *not* require independent tests. In this case, however, this is likely not powerful enough as it bounds the FDR in general, even in the presence of negatively correlated hypotheses.

To expand on the Benjamini-Hochberg test, we acutally do not need independence and a condition in the paper I linked above suffices (actually, a weaker condition of Lehmann suffices). Thus we *can* apply Benjamini-Hochberg to Scott's example, assuming that we have this positive dependency. Thus, suppose that we have 99 tests with a p-value of .04 and one with a p-value of .06. Then applying Benjamini-Hochberg would tell us that we can reject the first 80 tests with a FDR bounded by .05. This seems to match Scott's (and my) intuition that trying to test the same hypothesis in multiple different ways should not hurt our ability to measure an effect.

Expand full comment

The answer to this is Regularized Regression with Poststratification: see here -- https://statmodeling.stat.columbia.edu/2018/05/19/regularized-prediction-poststratification-generalization-mister-p/

Expand full comment

I work in big corp which makes money by showing ads. We run thouthands AB-tests yearly, and here are our the standard ways to deal with such problems:

1. If you suspect beforehand that there would be multiple significant hypothesis, run experiment with two control groups i.e. AAB experiment. Then disregard all alternatives which doesn't significant in both comparision A1B and A2B

2. If you run AB experiment and have multiple significant hypothesis, rerun experiment and only pay attention to hypothesis which were significant in previous experiment.

I am not statistician, so I'm unsure if it's formally correct

Expand full comment

>Metacelsus mentions the Holmes-Bonferroni method. If I’m understanding it correctly, it would find the ten-times-replicated experiment above significant. But I can construct another common-sensically significant version that it wouldn't find significant - in fact, I think all you need to do is have ninety-nine experiments come back p = 0.04 and one come back 0.06.

About the Holm-Bonferroni method:

How it works, is that you order the p-values from smallest to largest, and then compute a threshold for significance for each position in the ranking. The threshold formula is: α / (number of tests – rank + 1), where α is typically 0.05.

Then the p-values are compared to the threshold, in order. If the p-value is less than the threshold the null hypothesis is rejected. As soon as one is above the threshold, that one, and all subsequent p-values in the list, fail to reject the null hypothesis.

So for your example of 100 tests where one is 0.06 and others are all 0.04, it would come out to:

Rank p Threshold

1 0.04 0.05 / 100 = 0.0005

2 0.04 0.05 / 99 = 0.00051


100 0.06 0.05 / 1 = 0.05

So you're right, none of those would be considered "significant". But you'd have to be in some pretty weird circumstances to have almost all your p-values be 0.04.

Expand full comment

"by analogy, suppose you were studying whether exercise prevented lung cancer. You tried very hard to randomize your two groups, but it turned out by freak coincidence the "exercise" group was 100% nonsmokers, and the "no exercise" group was 100% smokers."

But don't the traditionalists say that this is a feature, not a bug, of randomization? That if unlikely patterns appear through random distribution this is merely mirroring the potential for such seemingly nonrandom grouping in real life? I mean this is obviously a very extreme example for argumentative purposes, but I've heard people who are really informed about statistics (unlike me) say that when you get unexpected patterns from genuinely randomization, hey, that's randomization.

Expand full comment

About p thresholds, you've pretty much nailed it by saying that simple division works only if the tests are independent. And that is pretty much the same reason why the 1:1 likelihood ratio can't be simply multiplied by the others and give the posterior odds. This works only if the evidence you get from the different questions is independent (see Jaynes's PT:LoS chap. 4 for reference)

Expand full comment

re: Should they reroll their randomization

What if there was a standard that after you randomize, you try to predict as well as possible which group is more likely to naturally perform better, and then you make *that* the treatment group? Still flawed, but feels like a way to avoid multiple rolls while also minimizing the chance of a false positive (of course assuming avoiding false positives is more important than avoiding false negatives).

Expand full comment

The Bonferroni correction has always bugged me philosophically for reasons vaguely similar to all this. Merely reslicing and dicing the data ought not, I don’t think, ruin the chances for any one result to be significant. But then again I believe what the p-hacking people all complain about, so maybe we should just agree that 1/20 chance is too common to be treated as unimpeachable scientific truth!

Expand full comment

Hm. I think to avoid the first problem, divide your sample into four groups, and do the experiment twice, two groups at a time. If you check for 100 confounders, you get an average of 5 in each group, but an average 0.25 confounders in both, so with any luck you can get the statistics to tell you whether any of the confounders made a difference (although if you didn't increase N you might not have a large enough experiment in each half).

Expand full comment

When Scott first mentioned the Cordoba study (https://astralcodexten.substack.com/p/covidvitamin-d-much-more-than-you) I commented (https://astralcodexten.substack.com/p/covidvitamin-d-much-more-than-you#comment-1279684) that it seemed suspect because some of the authors of that study, Gomez and Boullion, were also involved in another Spanish Vitamin D study several months later that had major randomization issues (see https://twitter.com/fperrywilson/status/1360944814271979523?s=20 for explanation). Now, The Lancet has removed the later study and is investigating it: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3771318.

Expand full comment

The Bayesian analysis only really works if you have multiple *independent* tests of the same hypothesis. For example, if you ran your four tests using different survey populations it might be reasonable. However, since you used the same survey results you should expect the results to be correlated and thus multiplying the confidence levels is invalid.

As an extreme version of this, suppose that you just ran the *same* test 10 times, and got a 2:1 confidence level. You cannot conclude that this is actually 1000:1 just by repeating the same numbers 10 times.

Expand full comment

Combining data from multiple experiments testing the same hypothesis to generate a more powerful test is a standard meta analysis. Assuming the experiments are sufficiently similar you effectively paste all the data sets together, and then calculate the p value for the combined data set. With a bit of maths you can do this using only the effect sizes (d), and standard errors (s) from each experiment (you create 1/s^2 copies of d for each experiment and then run the p value). The reason none of the other commenters have suggested this as a solution to your ambidexterity problem (being ambidexterous isn't a problem! Ahem, anyway) is that you haven't given enough data - just the p values instead of the effect sizes and standard errors. I tried to get this from the link you give on the ambidexterity post to the survey, but it gave me seperate pretty graphs instead of linked data I could do the analysis on. However, I can still help you by making an assumption: Assuming that the standard error across all 4 questions is the same (s), since they come from the same questionarrie and therefore likely had similar numbers of responders, we can convert the p values into effect sizes - the differences (d) using the inverse normal function:

1. p = 0.049 => d = 1.65s

2. p = 0.008 => d = 2.41s

3. p = 0.48 => d = 0.05s

4. p = 0.052 => d = 1.63s

We then average these to get the combined effect size d_c = 1.43s. However all the extra data has reduced the standard error of our combined data set. We're assuming all experiments had the same error, so this is just like taking an average of 4 data points from the same distribution - i.e. the standard error is divided by sqrt(n). In this case we have 4 experiments, so the combined error (s_c) is half what we started with. Our combined measure is therefore 1.43s/0.5s = 2.86 standard deviations above our mean => combined p value of 0.002

Now this whole analysis depends on us asserting that these 4 experiments are basically repeats of the same experiment. Under than assumption, you should be very sure of your effect!

If the different experiments had different errors we would create a weighted average with the lower error experiments getting higher weightings (1/s^2 - the inverse of the squared standard error as this gets us back to the number of data points that went into that experiement that we're pasting into our 'giant meta data set'), and similarly create a weighted average standard error (sqrt(1/sum(1/s^2))).

Expand full comment

I wouldn't worry too much about the vitamin D study.

For one thing, it is perfectly statistically correct to run the study without doing all of these extra tests, and as you point out, throwing out the study if any of these tests comes back with p-value less than 0.05 would basically mean the study can never be done.

Part of the point of randomizing is that you would expect that any confounding factors to average out. And sure, you got unlucky and people with high blood pressure ended up unevenly distributed. On the other hand, Previous lung disease and Previous cardiovascular disease ended up pretty unbalanced in the opposite direction (p very close to 1). If you run these numbers over many different possibly co-founders, you'd expect that the effects should roughly average out.

I feel like this kind of analysis is only really useful to

A) Make sure that your randomization isn't broken somehow.

B) If the study is later proved to be wrong, it helps you investigate why it might have been wrong.

Expand full comment

"I think the problem is that these corrections are for independent hypotheses, and I'm talking about testing the same hypothesis multiple ways (where each way adds some noise). "

If you are testing the same hypothesis multiple ways, then you four variable should be correlated. In this case you can extract one composite variable from these four with a factor analysis et voilà, just on test to perform!

Expand full comment

"Maybe I need a real hypothesis, like "there will be a difference of 5%", and then compare how that vs. the null does on each test? But now we're getting a lot more complicated than just the "call your NHST result a Bayes factor, it'll be fine!" I was promised."

Ideally what you need is prior distribution of possible hypotheses. So for example you might think before any analysis of the data that there is a 90% chance of no effect and if there is an effect you expect the size to be normally distributed about 0 with a SD of 1. Then your prior distribution on the effect size x is f(x)= .9*delta(0)+.1*N(0,1) and then if p(x) is the probability of your observation given an effect size of x you can calculate the posterior distribution of effect size as f(x)*p(x)/int(f(x)*p(x)dx)

Expand full comment

If the express goal of the "randomization" is the resulting groups being almost equal on every conceivable axis and any deviation from equal easily invalidates any conclusion you might want to draw... is randomization maybe the worst tool to use here?

Would a constraint solving algorithm not fare much, much better, getting fed the data and then dividing them 10'000 different ways to then randomize within the few that are closest to equal on every axis?

I hear your cries: malpractice, (un-)conscious tampering, incomplete data, bugs, ... But a selection process that reliably kills a significant portion of all research depite best efforts is hugely wasteful.

How large is that portion? Cut each bell curve (one per axis) down to the acceptable peak in the middle (= equal enough along that axis) and take that to the power of the number of axes. That's a lot of avoidably potentially useless studies!

Expand full comment


This paper discusses more or less this issue with potentially imbalanced randomization from a Bayesian decision theory perspective. The key point is that, for a researcher that wants to minimize expected loss (as in, you get payoff of 0 if you draw the right conclusion, and -1 if you draw the wrong one), there is, in general, one optimal allocation of units to treatment and control. Randomization only guarantees groups are identical in expected value ex-ante, not ex-post.

You don't want to do your randomization and find that you are in the 1% state of world where there is wild imbalance. Like you said, if you keep re-doing your randomization until everything looks balanced, that's not quite random after all. This paper says you should bite the bullet and just find the best possible treatment assignment based on characteristics you can observe and some prior about how they relate to the outcome you care about. Once you have the best possible balance between two groups, you can flip a coin to decide what is the control and what is the treatment. What matters is not that the assignment was random per se, but that it is unrelated to potential outcomes.

I don't know a lot about the details of medical RCTs, but in economics it's quite common to do stratified randomization, where you separate units in groups with similar characteristics and randomize within the strata. This is essentially taking that to the logical conclusion.

Expand full comment

1. P-values should not be presented for baseline variables in a RCT. It is just plain illogical. We are randomly drawing two groups from the same population. How could there be a systematical difference? In the words of Doug Altman: ”Performing a significance test to compare baseline variables is to assess the probability of something having occurred by chance when we know that it did occur by chance.”

In this specific study however Im not all that sure that the randomization really was that random. The d-vit crowd in Spain seems to be up to some shady stuff: https://retractionwatch.com/2021/02/19/widely-shared-vitamin-d-covid-19-preprint-removed-from-lancet-server/. My 5c: https://twitter.com/jaralaus/status/1303666136261832707?s=21

Regardless it is wrong to use p-values to choose what variables to use in the model. It is really very straight forward, just decide a priory what variables are known (or highly likely) predictors and put them in the model. Stephen Senn: ”Identify useful prognostic covariates before unblinding the data. Say you will adjust for them in the statistical analysis plan. Then do so.” (https://www.appliedclinicaltrialsonline.com/view/well-adjusted-statistician-analysis-covariance-explained)

2. As others have pointed out, Bayes factors will quantify the evidence provided by the data for two competing hypothesis. Commonly a point nil null hypothesis and a directional alternative hypothesis (”the effect is larger/smaller than 0”). A ”negative” result would not be 1:1, that is a formidably inconclusive result. Negative would be eg 1:10 vs positive 10:1.

Expand full comment

Regarding part 2, when your tests are positively correlated, there are clever tricks you can do with permutation tests. You resample the data, randomly permute your dependent variable, and run your tests, and collect the p values. If you do this many times, you get a distribution of the p values under the null hypothesis. You can then compare your actual p values to this. Typically, for multiple tests, you use the maximum of a set of statistic. The reference is Westfall, P.H. and Young, S.S., 1993. Resampling-based multiple testing: Examples and methods for p-value adjustment (Vol. 279). John Wiley & Sons.

Expand full comment

You can't run a z-test and treat "fail to reject the null" as the same as "both are equivalent". The problem with that study is that they didn't have nearly enough power to say that the groups were "not significantly different", and relying on a statistical cutoff for each of those comparisons instead of thinking through the actual problems makes this paper useless, in my view.

Worst example: If you have 8 total patients with diabetes (5 in the smaller group and 3 in the larger) you are saying that a 3.5 fold difference in diabetes incidence rate (accounting for the group sizes) is not significant. Obviously that is the kind of thing that can only happen if you are relying on p-values to do your thinking for you, as no reasonable person would consider those groups equivalent. It's extra problematic because, coincidentally, all of those errors happened to point in the same direction (more comorbidities in the control group). This is more of the "shit happens" category of problem though, rather than a flaw in the study design.

There are lots of ways they could have accounted for differences in the study populations, which I expect they tried and it erased the effect. Regressing on a single variable (blood pressure) doesn't count... you would need to include all of them in your regression. The only reason this passed review IMO is because it was "randomized" and ideally that should take care of this kind of issue. But for this study (and many small studies) randomization won't be enough.

I put this paper in the same category of papers that created the replication crisis in social psychology: "following all the rules" and pretending that any effect you find is meaningful as long as it crosses (or in this case, doesn't cross) the magical p=0.05 threshold. The best response to such a paper should be something like "well that is almost certainly nothing, but I'd be slightly more interested in seeing a follow-up study"

Expand full comment

You need to specify in advance which characteristics you care about ensuring proper randomization and then do block randomization on those characteristics.

There are limits to the number of characteristics you can do this for, w a given sample size / effect size power calculation.

This logic actually says that you *can* re-roll the randomization if you get a bad one --- in fact, it says you *must* do this, that certain "randomizations" are much better than other ones because they ensure balance on characteristics that you're sure you want balance on.

Expand full comment

>Metacelsus mentions the Holmes-Bonferroni method. If I’m understanding it correctly, it would find the ten-times-replicated experiment above significant.

Unfortunately, the Holmes-Bonferroni method doesn't work that way. It always requires that one of the p-values at least be significant in the multiple comparisons sense; so at least one p-value less than 0.05/n, where n is the number of comparisons.

Its strength is that it doesn't require all the p-values to be that low. So if you have five p-values: 0.01, 0.02, 0.03, 0.04, 0.05, then all five are significant given Holmes-Bonferroni, whereas only one would be significant for the standard multiple comparisons test.

Expand full comment

I don't claim to have any serious statistical knowledge here, but my intuitive answer is that expected evidence should be conserved.

If you believe that vitamin D reduces COVID deaths, you should expect to see a reduction in deaths in the overall group. It should be statistically significant, but that's effectively a way of saying that you should be pretty sure you're really seeing it.

If you expect there to be an overall difference, then either you expect that you should see it in most ways you could slice up the data, or you expect that the effect will be not-clearly-present in some groups but very-clearly-present in others, so that there's a clear effect overall. I think the latter case means _something_ like "some subgroups will not be significant at p < .05, but other subgroups will be significant at p < (.05 / number of subgroups)". If you pick subgroups _after_ seeing the data, your statistical analysis no longer reflects the expectation you had before doing the test.

For questions like "does ambidexterity reduce authoritarianism", you're not picking a single metric and dividing it across groups - you're picking different ways to operationalize a vague hypothesis and looking at each of them on the same group. But I think that the logic here is basically the same: if your hypothesis is about an effect on "authoritarianism", and you think that all the things you're measuring stem from or are aspects of "authoritarianism", you should either expect that you'll see an effect on each one (e.g. p = .04 on each of four measures), or that one of them will show a strong enough effect that you'll still be right about the overall impact (e.g. p = .01 on one of four measures).

For people who are giving more mathematical answers: does this intuitive description match the logic of the statistical techniques?

Expand full comment

"Should they have checked for this right after randomizing, noticed the problem, and re-rolled their randomization to avoid it? I've never seen anyone discuss this point before. The purist in me is screaming no - if you re-roll your randomization on certain results, then it's not really random anymore, is it? But it seems harsh to force them to perform a study even though we know we'll dismiss the results as soon as we get them. If we made them check a pre-written list of confounders and re-roll until there were no significant differences on any of them, what could go wrong? I don't have a good answer to this question, but thinking about it still creeps me out."

The orthodox solution here is stratified random sampling, and it is fairly similar. For example, you might have a list of 2,500 men and 2,500 women (assume no NBs). You want to sample 500, 250 for control and 250 for intervention, and you expect that gender might be a confound. In stratified random sampling, instead of sampling 250 of each and shrugging if there's a gender difference, you choose 125 men for the control and 125 men for the intervention, and do the same for women. This way you are certain to get a balanced sample (https://www.investopedia.com/terms/stratified_random_sampling.asp, for example). While this only works with categorical data, you can always just bin continuous data until it cooperates.

The procedure is statistically sound, well-validated, and commonly practiced.

Expand full comment

>I chose four questions that I thought were related to authoritarianism

What you can do if you're trying to measure a single thing (authoritarianism) and have multiple proxies, is to average the proxies to get a single measure, then calculate the p-value using the average measure only. I'd recommend doing that now (even though, ideally, you'd have committed to doing that ahead of time).

Expand full comment

For the Vitamin D study, I think you are off track, as were the investigators. Trying to assess the randomness of assignment ex post is pretty unhelpful. The proper approach is to identify important confounders ahead of time and use blocking. For example, suppose that blood pressure is an important confounder. If your total n is 160, you might divide blood those 160 into four blocks of 40 each, based on blood pressure range. A very high block, a high block, a low block, and a very low block. Then randomly assign half within each block to get the treatment, so 20 very high blood pressure folks get the vitamin D and 20 do not. That way you know that blood pressure variation across subjects won't mess up the study. If there are other important confounders, you have to subdivide further. If there are so many important confounders that the blocks get to be too small to have any power, then you needed a much bigger sample in the first place.

Expand full comment

And, of course, the larger the sample, the less you have to use advanced statistics to detect an effect.

Another alternative: replicate your results. The first test, be sloppy, don't correct for multiple comparisons, just see what the data seems to say, and formulate your hypothesis(es). Then test that/those hypothesis(es) rigourously with the second test.

You can even break your single database into two random pieces, and use one to formulate the hypothesis(es) and the other to test it/them.

Expand full comment

Significance doesn't seem like the right test here. When you are testing for significance, you are asking a question about how likely it is that differences in the sample represent differences in the wider population (more or less, technically you are asking for frequentist statistics "if I drew a sample and this variable was random, what is the chance I would see a difference at least this large"). In this case, we don't care about that question, we care if the actual difference between two groups is large enough to cause something correlated with it to show through. At the very least, the Bonferroni adjustment doesn't apply, in fact I would go in the other direction. The difference needs to be big enough that it has some correlation with the outcome strong enough to cause a spurious result.

Expand full comment

On section I...

The key point of randomized trials is NOT that they ensure balance of each possible covariate. The key point is that they make the combined effect of all imbalances zero in expectation, and they allow the statistician to estimate the variance of their treatment-effect estimator.

Put another way, randomization launders what would be BIAS (systematic error due to imbalance) into mere VARIANCE (random error due to imbalance). That does not make balance irrelevant, but it subtly changes why we want balance — to minimize variance, not because of worries about bias. If we have a load of imbalance after randomizing, we'll simply get a noisier treatment-effect estimate.

"If the groups are different to start with, then we won't be able to tell if the Vitamin D did anything or if it was just the pre-existing difference."

Mmmmmaybe. If the groups are different to start with, you get a noisier treatment-effect estimate, which MIGHT be so noisy that you can't reject the vitamin-D-did-nothing hypothesis. Or, if the covariates are irrelevant, they don't matter and everything turns out fine. Or, if vitamin D is JUST THAT AWESOME, the treatment effect will swamp the imbalance's net effect anyway. You can just run the statistics and see what numbers pop out at the end.

"Or to put it another way - perhaps correcting for multiple comparisons proves that nobody screwed up the randomization of this study; there wasn't malfeasance involved. But that's only of interest to the Cordoba Hospital HR department when deciding whether to fire the investigators."

No. It's of interest to us because if we decide that the randomization was defective, all bets are off; we can't trust how the study was reported and we don't know how the investigators might have (even if accidentally) put their thumb on the scale. If we instead convince ourselves that the randomization was OK and the study run as claimed, we're good to apply our usual statistical machinery for RCTs, imbalance or no.

"But this raises a bigger issue - every randomized trial will have this problem. [...] Check along enough axes, and you'll eventually always find a "significant" difference between any two groups; [...] if you're not going to adjust these away and ignore them, don't you have to throw out every study?"

No. By not adjusting, you don't irretrievably damage your randomized trials, you just expand the standard errors of your final results.

Basically, if you really and truly think that a trial was properly randomized, all an imbalance does is bleed the trial of some of its statistical power.

See https://twitter.com/ADAlthousePhD/status/1172649236795539457 for a Twitter thread/argument with some interesting references.

Expand full comment

Regarding whether hypertension explains away the results, have not read the papers (maybe Jungreis/Kellis did something strictly better), but here's a simple calculation that sheds some light I think:

So 11 out of the 50 treated patients had hypertension. 39 don't.

And 15 out of the 26 control patients had hypertension. 11 don't.

You know that a total of 1 vitamin D patients were ICU'd. And 13 of the control patients were admitted.

There is no way to slice this such that the treatment effect disappears completely [I say this, having not done the calculation I have in mind to check it -- will post this regardless of what comes out of the calculation, in the interest of pre-registering and all that]

To check this, let's imagine that you were doing a stratified study, where you're testing the following 2 hypotheses simultaneously:

H1: Vitamin D reduces ICU rate among hypertension patients

H2: Vitamin D reduces ICU rate among non-hypertension patients.

Your statistical procedure is to

(i) conduct a Fisher exact test on the 11 [treated, hypertension] vs 15 [control, hypertension] patients

(ii) conduct a Fisher exact test on the 39 [treated, no-hypertension] vs 11 [control, no-hypertension] patients

(iii) multiply both by 2, to get the Bonferroni-corrected p-values; accept a hypothesis if its Bonferroni-corrected p-value is < 0.05

If we go through all possible splits* of the 1 ICU treated patient into the 11+39 hypertension+non-hypertension patients and the 13 ICU control patients into the 15+11 hypertension+non-hypertension patients (there are 24 total possible splits), the worst possible split for the "Vitamin D works" camp is if 1/11 hypertension & 0/39 non-hypertension treated patients were ICU, and 10/15 hypertension & 3/11 non-hypertension control patients were ICU.

In this case still you have a significant treatment effect (the *corrected* p-values for H1 and H2 are 0.01 and 0.02 in this case).

I don't know how kosher this is formally, but it seems like a rather straightforward & conservative way to see whether the effect still stands (and not a desperate attempt to wring significance out of a small study, hopefully), and it does seem to stand.

This should also naturally factor out the direct effect that hypertension may have on ICU admission (which seems to be a big concern).

Other kinds of uncertainty might screw things up though - https://xkcd.com/2440/ -- given the numbers, I really do think there *has* to have been some issue of this kind to explain away these results.

*simple python code for this: https://pastebin.com/wCpPaxs8

Expand full comment

Seems like it should be relevant that sunlight, which is a cause of vitamin D, also reduces blood pressure through nitric oxide.


Expand full comment

My suspicion is that you are reaching the limits of what is possible using statistical inference. There might be, however, alternative mathematical approaches that might provide an answer to the real question "should we prescribe X and, if so, to whom?". I refer specifically to optimisation-based robust classification / regression (see e.g. work by MIT professor Bertsimas https://www.mit.edu/~dbertsim/papers.html#MachineLearning - he's also written a book on this). But I would still worry about the sample size, it feels small to me.

Expand full comment

For the ambidexterity question, what was your expected relation between those four questions and the hidden authoritarianism variable? Did you expect them all to move together? All move separately but the more someone got "wrong" the more authoritarian they lean? Were you expecting some of them to be strongly correlated and a few of them to be weakly correlated? All that's to ask: is one strong sub-result and three weak sub-results a "success" or a "failure" of this prediction? Without a structural theory it's hard to know what to make of any correlations.


Then, just to throw one more wrench in, suppose your background doesn't just change your views on authoritarianism, but also how likely you are to be ambidextrous.

Historically, in much of Asia and Europe teachers enforced a heavy bias against left-handedness in activities like writing. [https://en.wikipedia.org/wiki/Handedness#Negative_connotations_and_discrimination] You don't see that bias exhibited as much by Jews or Arabs [anecdotal], probably because of Hebrew and Arabic are written right-to-left. But does an anti-left-handed bias decrease ambidexterity (by shifting ambidextrous people to right-handed) or increase ambidexterity (by shifting left-handed people to ambidextrous)? Does learning language with different writing directions increase ambidexterity? Is that checkable in your data?

Most of us learned to write as children [citation needed], and since most of your readership and most of the study's respondents are adults [citation needed] they may have already been exposed to this triumph of nurture over nature. It's possible that the honest responses of either population may not reflect the natural ambidexterity rate. Diving down the rabbit hole, if the populations were subject to these pressures, what does it actually tell us about the self-reported ambidextrous crowd? Are the ambidextrous kids from conservative Christian areas the kids who fell in line with an unreasonable authority figure? Who resisted falling in line? Surely that could be wrapped up in their feelings about authoritarianism.

Expand full comment

These examples are presented as being similar, but they have an important distinction.

I agree that testing p-values for the Vitamin D example doesn't make too much sense. However, if you did want to perform this kind of broad ranging testing, I think you should be concerned with the false discovery rate rather than the overall level of the tests. Each of these tests is, in some sense, a different hypothesis, and should receive it's own budget of alpha.

The second example tests as single hypothesis in multiple ways. Because it's a single hypothesis, it could make sense to control the overall size of the test at 0.025. However, because these outcomes are (presumably) highly correlated, you should use a method that adjusts for the correlation structure. Splitting the alpha equally among four tests is unnecessarily conservative.

Expand full comment

Regarding question 2: There are various options here, the majority of of which will be most effective if you (a) know what the dependence is between your test statistics, and (b) precisely specify what you want to test.

For (a): if what you’re doing here are basic two-sample t-tests, and if the sample size of ambidextrous people is “reasonably large” then by CLT your 8 sample means (2 samples x 4 responses) are ~multivariate Gaussian with a covariance matrix you can estimate accurately based on individual-level data. Your calculations — Bayesian and frequentist — should take that into account (I can explain more if this is indeed the case).

If it’s too small for CLT to be realistic, you can still do things with permutations but the hypotheses you can test are less interesting; I assume you would rather conclude “the mean is larger in group 2” (which a t-test can give you) than conclude “the distributions in the two groups are not identical” (which is what you usually get out of a permutation test).

For (b): the two main options are testing the global null hypothesis (no differences in any of the four responses vs some difference for at least one response) or testing for differences in the four individual responses. In the first case you want to compute some combined p-value (again various options but there is no universally best way to do this and it isn’t really valid to pick one after seeing the data).

In the second case a natural choice is to compute simultaneous confidence intervals for the four mean differences, calculating the correct width using the multivariate Gaussian approximation or bootstrap. I would recommend this last option: happy to provide further details on request.

Expand full comment

"When you're testing independent hypotheses, each new test you do can only make previous tests less credible."

This doesn't make sense. If the hypotheses are independent, your view of each hypothesis should not depend on the result in the others. Your view of the effect of a new drug on cancer progression should have no relation to whether you are at the same time looking into whether gender is related to voting preference in an election. The whole idea of multiple testing correction as practiced in mainstream science seems to readily lead to absurdity and I'm not sure why this isn't more discussed (maybe it is within statistics communities I'm not part of, but mainstream scientists without statistical background seem to just accept the fundamental weirdness of the idea at face value).

How this relates to your question, people are suggesting all sorts of ways to "correct" your result but it's not clear to me why these are improvements. The data seems convincing that there is a relationship--is it more convincing if you apply some correction?

Expand full comment

"But suppose I want to be extra sure, so I try a hundred different ways to test it, and all one hundred come back p = 0.04. Common sensically, this ought to be stronger evidence than just the single experiment; I've done a hundred different tests and they all support my hypothesis."

The problem here seems to me that you're trying to apply common sense to a topic that isn't really in the realm of common sense. This is true not only in the sense that our intuition does a poor job of estimating the effects of sample variation (and therefore seeing patterns where there are none), but more importantly in that the concept of a p-value isn't easy to parse in common sense.

For example, what do you mean by saying a p-value of 0.04 supports your hypothesis? Say you're doing a one sample two-sided t-test against an hypothesized mean: how is the p-value calculated? First (forgive me if I make some minor errors here) you calculate a test statistic T = (sample mean - hypothesized mean)/(sample standard error), then the p-value is (1 - t)*2 where t is value of the CDF of a Student's t-distribution with (sample size - 1) degrees of freedom... it's not immediately obvious that any of this will collapse down into something interpretable as "common sense".

What we can say is that - if the thing you're measuring really is normally distributed and has a true mean equal to the hypothesized mean, then across many samples the resulting p-values will be uniformly distributed between 0 and 1. If the true mean differs then the distribution of p-values will skew more and more towards low values (too bad the comments here don't allow pictures!).

This means that if there is no difference between the true and hypothesized mean, you'd expect to get p-values < 0.05, 5 % of the time. A p-value of 0.04 means that, if, hypothetically, the means weren't different, you'd only have seen as much or more difference than you actually saw, 4 % of the time. Saying that this supports your hypothesis is dangerous leap over what's really a non-intuitive idea. The best you can say is that the effect is unlikely to be due to chance. If you walk through life declaring that p = 0.04 supported your hypothesis you'd be wrong 4 % of the time. Every new test you do *does* (in a sense) make the previous test less credible because the more tests there are the more likely at least one thing is plausibly just random noise.

If you performed 100 studies and they all got p ~ 0.04 I would be astounded and confused but only convinced that something extremely bizarre is going on. What kind of process would even produce such a result with other than astronomically unlikely odds? If your studies' data were genuinely independent you should have got a scatter of p-values - between 0 and 1 if there were no difference, and skewed towards 0 if there was. Inventing this kind of scenario seems a perfect example of how "common sense" doesn't work here.

Personally, I think there's nothing a p-value can tell you that a confidence interval can't tell you 100x better, and die a little inside whenever I see one.

Expand full comment

"Check along enough axes, and you'll eventually always find a "significant" difference between any two groups; if your threshold for "significant" is p < 0.05, it'll be after investigating around 20 possible confounders...So if you're not going to adjust these away and ignore them, don't you have to throw out every study? I don't think there's a formal statistical answer for this."

This is a well-known problem, going back to Fisher, for which a lot of methods have been developed. See here for a fairly in-depth (in parts technical) treatment of them co-authored by one of the top statisticians of the last 50 years working in this field: https://arxiv.org/pdf/1207.5625.pdf .

Expand full comment

Regarding point 2, when you have multiple tests assessing the same hypothesis: why not use the average p-value? If the null hypothesis is true, you expect to obtain an average p-value equal or lower than the observed-average p-value, an observed-average p-value * 100 percent of the times.

Expand full comment

> For example, it seems like opposition to immigration really was strongly correlated with ambidexerity, but libertarianism wasn’t. So my theory that both of these were equally good ways of measuring some latent construct “authoritarianism” was wrong. Now what?

Did you check whether opposition to immigration was correlated with libertarianism in the way you expected? As a Trump-supporting libertarian, and an anti-immigration anti-authoritarian, I must admit I felt a bit frustrated with your identification of the "authoritarianism" latent variable.

In terms of "now what", maybe it's time to go back to the drawing board and think about what "authoritarianism" actually means. Is it even a real thing, or is it just a label that people apply to things that they don't like?

Expand full comment

I wrote a paper on why using inferential statistical tests for this purpose is wrong:


To quote our abstract:

Experimental research on behavior and cognition frequently rests on stimulus or subject selection where not all characteristics can be fully controlled, even when attempting strict matching. For example, when contrasting patients to controls, variables such as intelligence or socioeconomic status are often correlated with patient status. [...] One procedure very commonly employed to control for such nuisance effects is conducting inferential tests on [...] subject characteristics. [...] Such a test has high error rates and is conceptually misguided. It reflects a common misunderstanding of statistical tests: interpreting significance not to refer to inference about a particular population parameter, but about 1. the sample in question, 2. the practical relevance of a sample difference (so that a nonsignificant test is taken to indicate evidence for the absence of relevant differences). We show inferential testing for assessing nuisance effects to be inappropriate both pragmatically and philosophically, present a survey showing its high prevalence, and briefly discuss an alternative in the form of regression including nuisance variables.

Expand full comment

Those aren't Bayes factors. The Bayes factor in favor of the Ambidextrous Authority hypothesis (AAH) would be P(you got that data|AAH) / P(you got that data|~AAH) = P(you got that data|AAH) / p-value.

You haven't figured out what you think is the probability of getting those results if your non-null hypothesis is true. That's tricky, because it means you have to nail down a concrete non-null hypothesis. That's *especially* tricky because you're looking at four different outcomes that won't necessarily fit nicely under a single model.

How much do you want this number?

Expand full comment

Жубайым экөөбүз Миддлмарчты, провинциялык жашоо жөнүндө изилдөө, окуп жатып, бир досубуз бизге бул шилтемени жөнөттү. https://youtu.be/KYWD45FN5zA

Түшүнүксүз телекөрсөтүү Кызыл Кытайдын Батышка каршы пландарын алдын ала айтканбы?

Expand full comment

Ok, Scott, I think this is a solved problem (under certain assumptions, that most of the multiple comparison methods like Bonferroni correction use), and furthermore, your intuition about "so I try a hundred different ways to test it, and all one hundred come back p = 0.04... this should be stronger evidence" is correct.

The classic version of the multiple comparison adjustment says "we have done K tests; given K, how likely was it that our best p-value would be smaller than some critical value? And specifically, how low a critical value alpha_adj do we need to use such that the chance that the best p-value is below alpha_adj is less than alpha?" And then to get an overall "family" type I error rate of alpha, we use the lower critical p-value for the individual test.

This is fine, but it only answers what it answers. It says, to be clear, how likely is it that our *best* p-value would be at least this small. But that throws away information on all the other p-values below the best one.

The situation you describe, with the four different experiments, testing the same hypothesis a few different ways, essentially says "hey, don't we have extra information? And doesn't it seem rather unlikely that most of the individual tests we conduct give results that, under the null, should be pretty unlikely to happen (strictly speaking, pretty unlikely that something at least this extreme happens)?" (Even if none of the p-values is extremely small, it *should* be pretty unlikely for independent tests to keep finding p = 0.05).

We can make this clear, and show the solution, with a simple example. Imagine a (possibly biased) coin, that we flip 10000 times. But we split the flips into 100 sets of 100 flips, and count the number of heads for each.

We're interested in testing the hypothesis that the coin is biased (towards heads), compared to a null of the coin being fair (Prob(Heads) = 0.5). Note that by construction, in this specific case, *either* the null is always correct, or it is always false. The coin is biased, or it isn't. You have 4 tests of the same underlying hypothesis; the hypothesis is either false, or it isn't. (Note the contrast with testing whether 100 different medications cure an illness - anywhere between 0 and 100 of them might be false. This complicates things, but I'll leave that aside for now).

We always (probably always?) construct statistical tests by comparing against the null of no effect / no difference etc. Under the null that the coin is not biased, each of our 100 experiments (of 100 coin flips each) is independent. You get 50 heads out of 100 in expectation, and experiment 73 yielding 57 heads tells you *nothing* about the distribution of outcomes for experiment 87. And so on. For each experiment you have a realisation, and a p-value associated with it (for simplicity, the one-sided p-value - our hypothesis is bias in favor of heads). Sort/rank the p-values.

The typical multiple comparisons test says "under the null of no true effect, how likely is it that the best p-value that arose from doing this procedure K (i.e. 100) times would be smaller than our best (smallest) p-value?" This can be calculated using the standard corrections (Bonferroni is a linear approximation of the correct calculation, which is Sidak's method mentioned in the linked piece).

But we can also say "ok, but how likely is it that if we did this procedure K times, the Nth smallest p-value would be smaller than our Nth smallest p-value?" You can do this separately for each value of N (from 1 to K). In each case, it is a fairly standard binomial probability calculation, which will look really ugly to write without LATEX. In your example, if you have 4 experiments, it is pretty unlikely that the second worst would still have a p-value of ~0.05. So this method picks up on the point you intuitively arrive at - if you do a randomised experiment many times, it should be pretty unlikely to get borderline significant results every time. (Note that in my stylised example, we can just combine the experiments into one jumbo experiment of 10000 coin flips, and then all those p = 0.04 will become a single p = 0.000000000001, exactly matching your intuition that these low p-values should "add up" in evidentiary terms).

Another commenter mentions Holmes-Bonferroni (which you then comment on). This is not the same. It (I think) ends up being a linear approximation to the correct correction in the "100 different tests for 100 different medications curing an illness" (with a weird caveat I'll skip). Another commenter mentions Benjamani-Hochberg. I think it is basically a linearised version of the the binomial probability calculation I mention above - it is close to identical for small N (relative to K, i.e. is the second best p-value demonstrative of a "real" effect) but (I think) is way too conservative for N close to K. (I want to avoid writing 500 words explaining the distinction here that no-one will read... so leaving there)

So basically, you're correct, and it can be calculated properly.

CAVEAT: Independence - note that in my example, each of my 100 experiments was independent. In your example, you have 4 tests of a hypothesis. If those tests are independent - essentially, the information they contain doesn't overlap in any way, this method works. If the tests aren't independent, all of the most standard multiple testing corrections methods are in trouble. (Because calculating joint probabilities is hard for non-independent events, you can't just multiply p(A) with p(B)).

Thinking about independence is tricky, because "if the hypothesis is true", of course all 4 tests are likely to give low p-values, and if the hypothesis is not, they are not. Isn't this a violation of independence? No. Why not? Because we construct p-values under the null of no true effect, i.e. that the hypothesis is false. So what we mean by independence is "assuming there is no true effect, would finding a low p-value on test #1 change the probability distribution of p-values (e.g. make a low p-value more likely) for test #2-4?" The coin flip example is a good example of independence. As for a simple example of non-independence, consider the following: we want to know if wearing a headband makes athletes run faster. So we turn up at the local track meet, randomly give some kids headbands, and time them in the 100m sprint. (And get some p-value). Then we *don't* reallocate the headbands, and time them in the 200m sprint. (And get some p-value). Even if the headband has no effect, if we randomly happened to give the headbands to kids who are faster (or slower) than average already, we'll get similar results (and thus p-values) in each case because people who are fast at 100m are also fast at 200m. The experiments reuse information.

Expand full comment

Many are making similar comments below, but the key takeaway is that naive randomization is dumb because we can do stratified sampling instead. Naive randomization is an artifact of the days before fast computers when a good stratification would have been hard.

Stratification is like if you did naive randomization a million times and then took the randomization instance with the best class balance and ran with it, which despite Scott's reservations is better than taking a crappy randomization that guarantees problems with interpretation.

But even with this crappy randomization they can still so propensity score matching and deal with it, up to a point.

Expand full comment

False Discovery Rate (FDR) correction.(Benjamini and Hochberg) is now much more common than Bonferroni because, yeah, Bonferroni is way too conservative. FDR effectively makes your lowest p value N times higher (N is number of tests), the next best one N/2 times higher, the next one N/3 times higher and so on. In contrast Bonferroni makes them all N times higher.

Also for both FDR and Bonferroni there are modifications for correlated tests which males the multiples smaller, so you don't destroy your significance just by doing the same test on a bunch of copies of the same data set or other correlated variables from one data set, so the "100 different ways to test it" concern isn't really a concern since you won't be multiplying by 100, even for Bonferroni.

Bonferroni without correction for correlations is really only recommended for those with p values so small they can afford to just multiply by a large number for no reason, still have significance, and thus dispense with their critics. It's the ultimate power move, pun intended.

Expand full comment

This kind of thing is super relevant when you're analyzing the results of an A/B test in a game, website, or app: you may have one target variable (revenue/retention/crash rate/etc), but in addition it's *really* crucial to make sure that your changes aren't having negative effects that you didn't expect. You can end up looking at 30 different metrics that all measure engagement for different types of users and tell different stories, and the (extremely tough and almost never well done) job of an analyst is to figure out whether the test improved anything or not.

Then the fuckin PM who specced the feature comes in, looks through the 50 metrics associated with a test, picks the one with the biggest change and plugs it into an online significance calculator to declare victory and get a promotion. C'est la vie.

This stuff is way more fiddly and subjective in the real world than stats classes make it seem. A good analyst can make the data tell almost any story they want, and can do so without making it clear that they cheated, and sometimes without even realizing it themselves. A great analyst knows how bad results smell and how to avoid deluding themselves, and proves it by delivering business results. Unfortunately it's really hard to tell the difference until you've worked with someone for some time.

Expand full comment

If you rerandomized until none of your comparisons had p <0.05 for any difference in any of 100 possibly-relevant measurements, you would in some sense have an incredibly weird and unrepresentative instance of dividing the group into control and treatment groups. Maybe it would be okay, maybe it wouldn't, but I would certainly no longer be comfortable considering it a randomized study.

Expand full comment

> Should they have checked for this right after randomizing, noticed the problem, and re-rolled their randomization to avoid it? I've never seen anyone discuss this point before.

Rerandomization is definitely a thing that gets discussed. See, for example, https://healthpolicy.usc.edu/evidence-base/rerandomization-what-is-it-and-why-should-you-use-it-for-random-assignment/

Expand full comment

Judea Pearl has written a book about using Bayesian statistics while including causality graphs. I still have it on my list to read it again because much went over my head. However, it seems very applicable to these problems. I recall he claimed that you could do randomized tests much more efficient than randomized trials. https://en.wikipedia.org/wiki/The_Book_of_Why

Expand full comment

> Multiply it all out and you end up with odds of 1900:1 in favor

That's not how that works. You were complaining that Bonferroni adjustments were too conservative, because they under-account for correlation, but now you did the opposite, where you assumed they were independent evidence.

Of course, understanding the correlation of multiple streams of evidence is HARD. And that's a large part of why people really like the 1-study, 1-hypothesis setup. It's the same kind of curse of dimensionality we see making lots of other things hard.

Informally, figuring out the correlation matrix of N variables requires far more data than figuring out pairwise correlations - so you're probably not actually better off with 10 studies looking at impact of 1 variable each than you are with 1 study ten times as large looking at all of them, but the size of the confidence intervals, and the costs of your statistical consultants figuring out what your data means, will be far higher in the second case. Keeping it simple fixes that - and punitive Bonferroni corrections are conservative, but for exactly that reason they keep people very, very honest with not overstating their findings.

Expand full comment

Maybe you could use the „Structure Equation Modeling“ / „Confirmatory Factor Analysis“ Framework for your Analysis. You are using the four indicators, because you think they're influenced by one common factor „authoritarianism“. Naturally the four indicators are not 100% predetermined by „authorianism“ and there is some error. The SEM-approach allows you to extract the common factor with error correction. (I think of this approach as some kind of optimization problem: find the value for each datapoint which best predicts the four measured indicators.)

After that you could correlation between your manifest measure of ambidexterity and the latent measure of authoritarianism (with only one comparison and corrected measurement error in your four indicators).

Expand full comment

>Should they have checked for this right after randomizing, noticed the problem, and re-rolled their randomization to avoid it? I've never seen anyone discuss this point before. The purist in me is screaming no - if you re-roll your randomization on certain results, then it's not really random anymore, is it? But it seems harsh to force them to perform a study even though we know we'll dismiss the results as soon as we get them. If we made them check a pre-written list of confounders and re-roll until there were no significant differences on any of them, what could go wrong? I don't have a good answer to this question, but thinking about it still creeps me out.

See https://en.wikipedia.org/wiki/Stratified_sampling

Stratify your population into subpopulations based on every variable which could conceivably correlate with ease of COVID recovery, then randomly split each subpopulation into a treat/do not treat group.

Another strategy I've seen is to pair every study participant with some other participant who looks very much like them in every way which seems to matter. Randomly give Vitamin D to one but not the other.

I guess these strategies might be a little impractical if you don't have access to the entire population at the start of the study (e.g. if a few new people are entering/exiting the ER for COVID every day, and you want to run the study over several months).

Expand full comment

> But with very naive multiple hypothesis testing, I have to divide my significance threshold by one hundred - to p = 0.0005 - and now all hundred experiments fail. By replicating a true result, I've made it into a false one!

This is really not the right interpretation of p-values.

"I try a hundred different ways to test it, and all one hundred come back p = 0.04." is not a thing. You would not get the same value on every test; you'd get a variety of answers, some that pass the 0.05 threshold and some that fail it. For a simple sample of a normal distribution the mean you measure will vary proportional to σ/sqrt(n). If your sample size isn't huge, a significant number of your measurements could be a whole standard deviation or more closer to the null hypothesis value -- definitely not a passing p-value at all.

In your case, getting 1 significant, 1 very not significant, 2 close (which imo is the better read of the situation) is clearly very possible, like, a distribution around a mean which falls either just above p=0.05 or just below. Clearly it's not very valid to just say "well two are significant therefore the result is significant" -- your very-failing test _disproves_ the same thing your passing tests prove!

(That said, maybe the failing question was just not a good choice of signal. Nothing can be done in the statistics about that.)

Expand full comment

>In the comments, Ashley Yakeley asked whether I tested for multiple comparisons; Ian Crandell agreed, saying that I should divide my significance threshold by four, since I did four tests. If we start with the traditional significance threshold of 0.05, that would mean a new threshold of 0.0125, which result (2) barely squeaks past and everything else fails.

No, you did eight tests since it was two-tailed. The correct threshold is 0.00625. "Ambidextrous people more libertarian" and "ambidextrous people less libertarian" are two separate comparisons and you should correct for that. You should also pre-register (at least in your head) whether you're doing a one or two tailed test.

It's worth noting that you are allowed to divide the comparisons up unevenly when doing bonferroni correction. Consider an experiment with two comparisons. You can say that Comparison A has to be P<0.01 or comparison B has to be P<0.04, rather setting both thresholds at 0.025.

IANAS but my intuition is that this sort of statistics is the researcher saying "The null hypothesis is false and I can prove it by specifying ahead of time where the results will end up." They can mark out any area of the possible result-space so long as the null hypothesis doesn't give it more than a 1/20 chance of the results landing in that area.

Expand full comment

If your outcomes are correlated, use the Westfall-Young correction instead of Bonferroni. Bonferroni is generally regarded as overly conservative


Expand full comment

Now even more confusion on the Vitamin D front! News story today saying that there is now a report by a group of politicians in my country recommending people take Vitamin D supplements: https://www.rte.ie/news/2021/0407/1208274-vitamin-d-covid-19/

"The 28-page report, published this morning, was drawn-up by the cross-party Oireachtas Committee on Health in recent weeks as part of its ongoing review of the Covid-19 situation in Ireland.

It is based on the views of the Covit-D Consortium of doctors from Trinity College, St James's Hospital, the Royal College of Surgeons in Ireland, and Connolly Hospital Blanchardstown, who met the committee on 23 February.

The Department of Health and the National Public Health Emergency Team have previously cautioned that there is insufficient evidence to prove that Vitamin D offers protection against Covid-19.

The report says while Vitamin D is in no way a cure for Covid-19, there is increasing international evidence from Finland, France and Spain that high levels of the vitamin in people can help reduce the impact of Covid-19 infections and other illnesses."

So should you or shouldn't you take vitamin D? At this stage I honestly have no idea. Presumably you should if you're deficient, but too much will lead to calcium deposits on the walls of your blood vessels, according to a Cleveland cardiologist https://health.clevelandclinic.org/high-blood-pressure-dont-take-vitamin-d-for-it-video/. So how much is *too* much?

The report in question: https://www.rte.ie/documents/news/2021/04/2021-04-07-report-on-addressing-vitamin-d-deficiency-as-a-public-health-measure-in-ireland-en.pdf

Expand full comment

>Should they have checked for this right after randomizing, noticed the problem, and re-rolled their randomization to avoid it? I've never seen anyone discuss this point before. The purist in me is screaming no - if you re-roll your randomization on certain results, then it's not really random anymore, is it?

Better if they consider this *before* randomizing, and define quantitative criteria for what constitutes an acceptable randomization. If I roll 1d100 and, post facto, throw out results that "look too big" or "look too small", I've got a random distribution of middle-sized two-digit numbers biased by my intuitive assessment of "bigness" and "smallness". If I decide a priori to roll 1d100 and reject values below 15 or over 95, that's a truly random distribution over the range of 15-95.

With a small sample and many relevant confounding variables, it's likely that any single attempt at randomization is going to be biased one way or another on one of them. Fortunately, we've got computers and good random number generators; if you can define your "acceptable randomization" criteria mathematically, you can then reroll the dice a million times until you get something that fits those criteria - and without introducing biased human judgement into the assessment of any proposed randomization.

You've still got the potential for bias in the definition of your acceptability criteria, but that's an extension of the same bias you unavoidably introduced when you decided to look at e.g. blood pressure but not eye color. I don't think that becomes fundamentally worse if you add simple criteria like "for the variables we are looking at, an acceptable randomization must not deviate from the mean by more than one standard deviation".

Expand full comment

I think the “sensitivity analysis” where they adjust for blood pressure is adequate. It’s not an observational study so we know that the differences in blood pressure aren’t caused by some third factor other than randomization. Also you don’t have to worry too much about multiple testing corrections because both results are significant since it’s less likely the null hypothesis is true, and agreement by chance is only an issue with null hypothesis assumptions.

Expand full comment

For your Bayesian example you could solve it with a hierarchical model but that’s very technically challenging.

Expand full comment

For the first one: reshuffling until you get well-balanced treatment and control is legit. You're not looking at the effect size, so you don't introduce multiple hypothesis testing issues. This is the basis for genetic matching https://ideas.repec.org/a/tpr/restat/v95y2013i3p932-945.html More generally, the solution to this problem is matching https://www.wikiwand.com/en/Matching_(statistics) It's not uncontroversial (as usual, Judea Pearl has raised concerns), but people use it.

Expand full comment

It feels like an important element missing in this discussion is that the problem of multiple testing arises from the fact that tests may be correlated. In the extreme, for example, if all libertarians are pro-immigration and all non-libertarians are against immigration, then testing the relation between ambidexterity and libertarianism will give you no additional information if you've already tested fo the relation between ambidexterity an pro/anti immigration positions.

Of course, taking into account these correlations might be very tricky, so the easiest fix I can think of is to use Montecarlo to build confidence intervals on some pre-determined set of variables, and use those to test whether your particular randomization passes the "statistically random" test.

Expand full comment

Regarding "Check along enough axes, and you'll eventually always find a "significant" difference between any two groups; if your threshold for "significant" is p < 0.05, it'll be after investigating around 20 possible confounders (pretty close to the 15 these people actually investigated)."

Once the number of confounders is 14, there's a better than 50% chance that at least one of them will be a false positive (0.95 ^ 14 = 0.488).

Expand full comment

I think people on this thread might enjoy the paper 'Multiple Studies and Evidential Defeat' by Matt Kotzen: http://dx.doi.org/10.1111/j.1468-0068.2010.00824.x or https://sci-hub.do/10.1111/j.1468-0068.2010.00824.x. Quoting the first three paragraphs:

'You read a study in a reputable medical journal in which the researchers re- port a statistically significant correlation between peanut butter consumption and low cholesterol. Under ordinary circumstances, most of us would take this study to be at least some evidence that there is a real causal connection of some sort (hereafter: a real connection) between peanut butter consumption and low cholesterol.

Later, you discover that this particular study isn’t the only study that the researchers conducted investigating a possible real connection between peanut butter consumption and some health characteristic. In fact, their research is being funded by the Peanut Growers of America, whose explicit goal is to find a statistical correlation between peanut butter consumption and some beneficial health characteristic or other that they can highlight in their advertising. Though the researchers would never falsify data or do anything straightforwardly scientifically unethical,1 they have conducted one thousand studies attempting to find statistically significant correlations between peanut butter consumption and one thousand different beneficial health characteristics including low incidence of heart disease, stroke, skin cancer, broken bones, acne, gingivitis, chronic fatigue syndrome, carpal tunnel syndrome, poor self-esteem, and so on.

When you find out about the existence of these other studies, should that at least partially undermine your confidence that there is a real connection between peanut butter consumption and low cholesterol?'

The paper basically argues on Bayesian grounds that the answer is 'no'. The takeaway for me is that the whole 'correcting for multiple hypothesis' thing is just some crude system that journal editors need to cook up in order to keep from being swamped by uninteresting papers, and not something that people who are just interested in finding out the truth about some question need to pay any attention to.

Expand full comment

My answer to this - null hypothesis statistical significance tests are bad and should be abandoned - see this post from Andrew Gelman:


It's better to treat p-values almost like effect sizes, in that you develop an intuitive understanding of what different p-values mean. Then you apply Bayesian reasoning on whether the effect is plausible. My own personal intuition is something like this, when deciding whether to include effects in a multiple regression model:

p > 0.25 - Very little evidence. Include effect in model if there is very strong prior evidence. (very strong evidence might be stuff like "houses with more square footage sell for higher prices." The evidence should be nearly unassailable)

0.1 < p < 0.25 - Extremely weak evidence from data. Include effect in model if there is strong prior evidence. (I'd maybe say strong evidence would be something like "SSRIs have some benefit for anxiety and depression." There is maybe a small minority of evidence suggesting that SSRIs don't work, but overall there is a decently strong consensus that they do something).

0.1 < p < 0.05 - Weak evidence from data. Include effect in model if there is moderate prior evidence. (This should still be relatively settled science. Maybe the "SSRIs work" question would have had this level of evidence in 1995, when some data was available, but before we had tons of really good trials with lots of different SSRIs).

0.05 < p < 0.01 - Okay evidence from data. Should be some prior evidence, maybe a good causal explanation. (Maybe this would be the "SSRIs work" question in 1987 right after Prozac was approved. Some high quality data exists, but there aren't multiple studies looking at the problem from different angles).

0.001 < p < 0.01 - Moderate evidence from data. May sometimes accept result even in the absence of any good priors. I'd say your authoritarianism vs. ambidexterity result falls in this category, personally. The prior data isn't terribly convincing to me, so I'd want to see p-values in this range for me to consider this as a potentially real effect.

p < 0.001 - Strong evidence from data. Consider including effect even if weak/moderate prior evidence points in the opposite direction.

Expand full comment

I think there are a couple of things going on. First, you're often conflating effect size and significance. Just because something is significantly different between the groups does not mean it's a plausible alternative explanation (though it could be). It sounds like some other commenters have mentioned this. Also, if it is different, you could at least potentially do some sort of mediation analysis.

Second, you're not really thinking statistically. There's no certainty in any analysis; we're just comparing data to models and seeing how consistent or inconsistent they are. That gives us information but not conclusions, if that makes sense. More specifically, "correcting for multiple comparisons" is just ensuring that your false positive rate remains approximately 5% under certain assumptions. You can interpret p-values without that, and you still have to interpret them with that.

There might also be something similar to the gambler's fallacy going on here. Each individual p-value is independent (sort of), and you should expect about a 5% false positive rate, but if you anticipate doing 100 tests, you should anticipate a higher chance of getting *at least one* false positive.

A good case study of the statistical thinking issue is in the example you give. If you do one hundred tests and all of them come back p = 0.04, something has gone horribly, horribly wrong. The chance of all of them being just under the threshold is incredibly unlikely. In fact, many p-values just below .05 is often good evidence of p-hacking. I would interpret 100 cases of p = 0.04 as some sort of coding or computer error, or the strangest p-hacking I've ever seen. I know that's not the point of the thought experiment, but I think how you chose to construct the thought experiment is revealing of what you're misunderstanding here.

Expand full comment

The core of science is not significance testing but repeatability and replicability.

As you intuit, multicomparison testing trades false positives for false negative. This article and, more importantly, downloadable spreadsheet from Nature can help you explore that (https://www.nature.com/articles/nmeth.2900#MOESM343). Whether that’s “good” depends on what you’re trying to do. Want to declare something to be the ABSOLUTE TRUTH? Than suppress the false positives! Trying to figure out where to dive in deeper for your next study? Than do no corrections and explore the top statistical features in an intentional follow-up study. Intentionality is how we test for causality, which is what we’re really after in science.

I also highly advise understanding type M and type S errors, because overaggressive multicomparison testing leads to these at the meta-study level causing systemic replication issues. Andrew Gelman discusses them at various points in his books and on his great blog. Gelman is a Bayesian, but this point isn’t inherently Bayesian, just more intellectually obvious to someone working in a Bayesian paradigm

Expand full comment

For the four measures of authoritarianism problem, I'd combine them into one. You could average them or do a weighted average in whatever way made sense to you as long as you did so before seeing how each measure correlated individually.

Expand full comment

Since your tests are likely highly positively correlated, you might want to use a resampling method, like the Westfall & Young minP test. But it can't be done from the summary stats in the post: you need the raw data and a bit of computing time. I see others here also mention this: it should be better known.

There are many surprising ways to get valid multiple-hypothesis rejection procedures that most people don't know about, but the key is you have to pick one ahead of time. Everyone uses Bonferroni so using Bonferroni doesn't raise any questions, but if you used "Scott's ad-hoc procedure" then everyone wonders whether it was cherry-picked. (A meta-multiple hypothesis testing correction might be needed?) Holmes-Bonferroni is also a good test because it's just strictly better than Bonferroni so really we should just use that as our default and it's not hard to defend. But in general, they all have weird edge-cases where you will be disappointed and surprised that it says "non-significant": that's inevitable.

But here's a simple alternative procedure. If you can order your hypotheses ahead of time (say, by your a priori expectation of how important they are), then reject starting at the top of the list all of those with p < 0.05 and stopping at the first one that's p>0.05. I.e. there's no Bonferroni-esque correction factor at all if you already have an order. If you have a good ordering of the importance, this might be the best you can do, but you better pre-commit to this since it's easily game-able.

Also, a pet peeve: Bonferroni doesn't assume independence. Independence is a very particular and restrictive assumption which rarely holds in the cases where you want Bonferroni. Bonferroni works *always*. The hardest case is negatively-correlated p-values, and Bonferroni still works for that. If your results were truly independent, then you would use Fisher's combined p-value, not Bonferroni correction, and get a low better power from it. There are corrections for Fisher's that supposedly take into account dependence (Brown's method) and that might be applicable here, but I don't entirely trust them (out of ignorance on my part of how they work).

Expand full comment

"I think the problem is that these corrections are for independent hypotheses, and I'm talking about testing the same hypothesis multiple ways (where each way adds some noise)." — No. The problem is your hypothesis/data ratio.

Given a fixed set of data, there are many more concepts that can be used to divide up the data than there is data itself. If you continually test new hypotheses without adding more data, then you will eventually stumble across a concept that happens to divide the data into "statistically significant" groups. But that doesn't mean the concept isn't an artificial distinction, like (got COVID)+(has golden retriever)! In fact, it very likely is.

The challenge here is that a hypothesis isn't just an English-language hypothesis. Every new mathematical specification of that English-language summary is effectively a new hypothesis, because it treats phenomena you _assume_ to be irrelevant differently. (See, e.g. Tal Yarkoni's "The Generalizability Crisis" (https://psyarxiv.com/jqw35).) This means that testing multiple mathematical specs of the same conceptual idea on the same data is almost certainly going to see a correlation arising from an artificial conglomeration of irrelevant factors, just by random chance.

To avoid this, each time you add a hypothesis, you need to either add more data or increase your skepticism. A Bonferroni-type correction is the latter option.

Expand full comment

This is, indeed, a more difficult problem than most scientists are willing to admit.

For what it's worth, in my field (particle physics), we take a different approach:

We still talk about p-values and report experimental deviations from theory in terms of 𝜎-confidence levels (see recent "4.2𝜎 anomaly in muon g-2"). However, nobody has any idea how to quantify or correct for testing multiple hypotheses.

An example: in searches at the Large Hadron Collider, we search for particles of mass X and correct the statistical significance of some deviation in bin X due to the "look elsewhere effect" (LEE): what is the probability we would have seem *some* deviation at *some* mass bin? But we don't just measure mass/energy; in particle collisions we measure thousands of properties of (sometimes) dozens of outgoing particles. And these properties are basically all associated with *some* theory of New Physics. So... in a frequentist statistical sense, how many hypotheses are we testing? I think it's fair to say nobody has any idea.

Because of this, we run into stupid problems, like the fact that a 99% of our 3𝜎 experimental signals actually are just statistical anomalies, not the other way around as native p-values will suggest. Pop-science articles continue to say "this means there is only a 1 in a thousand chance of a statistical fluke!" and meanwhile smart particle physicists ignore such anomalies until more data comes in.

There's a growing contingent of physicists who want to go full Bayes, and maybe that's the right way to go. But the historical solution is more elegant: we set the threshold of "discovery" to 5𝜎, a threshold so obscenely high that we basically never claim discovery of a signal that goes away later.

Expand full comment

The thing that's confusing you is that p-value hypothesis testing only cares about the risk of declaring the hypothesis when it isn't true. This is similar to in court where they say "you must be 99% sure of guilt to declare guilt". So it's a weighted outcome.

So, from p-value hypothesis testing's point of view, if you do 100 tests (doesn't matter if they're the same or different -- just as long as the experiments themselves are independent), then that is just 100 opportunities to declare the alt-hypothesis when it isn't true, so your effective p-value quickly becomes "1" unless you rescale your individual p-values as you describe.

When you say that if you start viewing your career of paper-reading within the p-value-testing framework, you quickly find you can't accept anything -- that is exactly the case! Look at it this way, if you read 1 paper with a positive result, and I ask you "do you think you've read a false positive", you'll say "well, hopefully not if they've done everything right then with probability p this is a false positive", but if you read 1000 papers, then you have to say "yes, I'm sure there was a false positive in there".

The issue is you're misusing "p" as an uncertainty or accuracy parameter. When actually "p" is very specifically the probability of reading a false positive. And this is what's confusing you when interpreting what happens when you combine results (the p-value raises, when you think it should fall).

If you want to use p-value stuff for this question, you want to combine all information into a single test and set the p-value of that appropriately. But if I were you I'd just find a nice way to graph it and have done with it.

Expand full comment

"By replicating a true result, I've made it into a false one!"

This can happen easily when using hypothesis testing, as a test is allowed to be arbitrarily bad as long as it doesn't reject the null hypothesis too often when the null hypothesis is true. For example a valid test for a p-value of 0.05 is the following, I use a test using a p-value of 0.01, but 4% of the time I randomly reject H0. In addition, different tests might work well in different situations, but you are allowed to only pick one, in advance.

Dividing by the number of comparisons works best if the comparisons are independent, if you are testing for the same exact comparison multiple times, this is not true so it will not be a very good test, in the sense that it will accept H0 when other higher powered tests would not. But that is inherent in hypothesis testing, rejecting H0 tells you something, accepting it tells you very little, without more context (sometimes if you have a lot of data and you know you used a good test you can say look, if there really was any meaningful effect I would have been able to reject my null hypothesis by now...).

Expand full comment

Correct me if I'm wrong, but doesn't multiple hypothesis testing have more problems when you start looking at composite outcomes?

For example, if you are looking at an apixaban study where the primary outcome is any major cardiovascular event, death, stroke, or major bleeding you are testing multiple hypothesis at the same time, and therefore you need to correct the significance level to correct for that. Basically the correction is to make sure that if you are testing more than one hypothesis in the same test then you adjust significance to match.

If you are looking at a number of different hypothesis independently you might credibly be accused of p hacking (looking for any significant results in the data no matter what you originally set out to do), but I don't think that it is standard practice to adjust significance levels based on the number of secondary outcomes.

Expand full comment

Simplest solution: when you get such big difference between treatment and control group, split your study into two substudies: efficacy of vitamin D in hypertensive and normotensive patients. If you get stastically significant results in one or both substudies - make a bigger study.

Another thing that would be useful would be putting more emphasis on large number of small exploratory studies - it's better to do ten n=10 studies and then one n=100 study for the treatment that had the biggest effect size than n=200 study that leaves us confused what was the deciding factor

Expand full comment

I'm pitching a bit outside of my league here, but could you run a PCA on the authoritarian questionnaire? As far as I understand it, you could use a single 'authoritarian' metric and correlate that with ambidexterity, keeping your alpha at 0.05. This of course assumes that 4 questions would produces a meaningful metric and that the questions capture the authoritarian construct, but that's assumed anyways.

Expand full comment

I have never understood why in cases like this they test for significant group differences. Even under a frequentist paradigm that does not make any sense. We don't want to be reasonably sure that the groups differ but reasonably sure that they do not differ.

From a frequentist viewpoint the correct way would be to employ equivalence tests, as in https://journals.sagepub.com/doi/full/10.1177/2515245918770963

But if that approach were used, in many cases it would show the groups not to be equivalent (unless you had a really large sample size).

Expand full comment

TLDR: if you don't know what mean shift in 1-5 rating corresponds to what odds ratio, then it's impossible to integrate the two effects.

The main issue with doing a full Bayesian (or a frequentest analysis focusing on effect size with uncertainty) is that you have two different effect scales, one is a odds ratio and the other is a shift in an ordered predictor (that you treat as metric :P), if the were all on the same scale, then it's trivial to do Bayesian update on the full distributions using grid approximation. If you have some theory of how to convert between these scales, then you can transform to that space and then do Bayesian update via grid approximation.

also, if your hypothesis is that ambidextrous people are less main stream, then you should test for a difference in sigma instead of a difference in mu :)

Expand full comment

In a bayesian analysis you can never just multiple odds-factors together from experiments, unless they are completely independent, which they almost never are. So the right calculation to the above would be:

1. 19:1 in favor

2. Odds of 2 being true conditional on 1

3. Odds of 3 being true conditional on 1 and 2

4. Odds of 4 being true conditional on 1 and 2 and 3

Of course, figuring out what the conditional probabilities are is hard and requires some underlying world model that establishes the relationships between those factors.

Expand full comment

The good news is that randomization isn't about balancing the samples relative to baseline characteristics anyway:


So this sort of "search for things that are different after randomization and call them confounders" isn't a problem that actually needs addressing.

Well, I guess it would be nice if people stopped looking for differences at baseline in randomized studies. *That's* a problem that needs addressing. And would save a lot of paper and / or electrons.

Expand full comment

I commented above on the part 1, on which (I now notice) several commenters have either similar or more enlightened things to say.

So about point (2). The Bayes factors have already been discussed. Commenter Tolkhoff mentioned hierarchical models. As an exercise and as a student of statistics, here is how I'd begin an analysis, if I were to do state of the art Bayesian inference and try to fit a Bayesian hierarchical (regression) model.

We have four measured (binary and ordinal) outcomes, let us label them A, B, C, D, which are assumed to noisily measure common underlying factor related to political level of ideology, Y, which is not directly observable. One wishes to use them to tell how much they (or actually, Y) are related to ambidexterity, X, binary variable that is known. To simplify the presentation a little bit, I pretend the 1-5 ordinals are also binary outcome variables by categorizing them (there are better ways, but this is an illustration).

In other words, we hypothesize there is some variable associated with ambidexterity-relevant-authoritarianism, Y, which is of some value if person is ambidextrous (X=1) and some other value if not (X=0), with some noise. I write, Y = bX + e, with e ~ N(0, sigma_y). Parameter b is the effect I wish to infer, sigma_y relates to magnitude of variation.

We further hypothesize people are more likely answer positively to question A if they have high level of Y. More precisely, say, they answer positively to question A with probability p_Y which somehow depends on Y. Statistically, observing a number of positive events n_A out of n_samples trials follows binomial distribution. Writing this in terms of distributions, n_A ~ Binomial(p_Y, n_samples). So what about probability p_Y? The classical way to specify the "it somehow depends on Y" part is to model p_Y = sigmoid(Y). Similarly for B, C, and D. After specifying some reasonable priors for effect b and sigma_y, one could try to fit this model and then look at inferred posterior distribution of b (and sigma) to read the Bayesian tea leaves to say something about the effect relative to noise.

One should notice that in this model, with shared parameter p_Y, I have assumed that all differences between number of positive answers to different questions would be due to random variation only. All the differences between how well the different questions capture the "authoritarianism as it relates to ambidexterity" factor would be eaten by the noise term e. More likely, one would like to fit a model that incorporates a separate effect how each question is related to the purported authoritarianism variable. I don't know if there is any canonical way to go about this, but one simple variation to initial model would be to replace b with separate effects b_A, b_B, ... (and thus separate probabilities p_Y,A, p_Y,B) for each question A, B, ... , where b_A, b_B, ... share common prior about common effect b (maybe writing b_A ~ b + N(0, sigma_ba), and some hyperprior for sigma_bas).

Sounds complicated? The hierarchical logic is less complicated than it sounds, but obtaining a model that fits and interpreting it can require some work. (The model I drafted is not necessarily the best one.) The positive part here is, there is no a pressing need to think about P-values or multiple hypothesis corrections. One obtains a posterior distribution for the common effect b and separate parts b_A, b_B, ..., which provides some level of support for or against each hypothesis, and fitted model provides a predictive distribution about future data. If one adds more questions (hypotheses) E,F,G, they either provide information about the common effect (affecting the shared beta) or not (affecting the variance estimates). Or, so the theory says.

Expand full comment

The difference is post-hoc analysis versus prediction. There are an infinite number of potential factors we could stratify against, making post-hoc statistical hypothesis testing an arbitrary artifact of the number of factors we choose to look at. The best this kind of thing can do is give us an observation to make a hypothesis based off of. We'd need to do another study to test that hypothesis generated by the post-hoc analysis.

In contrast, the researchers who said, "We think it's Vitamin D" had to pick the specific factor they thought would be significant before the experiment took place. The entire difference is whether you're able to predict the specific factor that will be different, versus being able to pick enough factors that one will be different by chance alone.

This is why you often hear "early study produces statistically strong p-value of unexpected finding", only to be followed up by a later study that shows no significant difference. The first study didn't show anything; it generated a hypothesis that hadn't been tested yet. It was the predictive study that actually tested the hypothesis - and found it was statistical noise.

Take this to your multiple hypothesis test of ambidextrous people. When you try to test the same hypothesis different ways, you're doing serial prediction-based tests of the same hypothesis - NOT doing multiple hypothesis testing.

(As you pointed out each different test comes with its own sub-hypothesis, which may be incorrect. For example, as someone with no handed preference I thought the Libertarian test was particularly poor for testing cognitive closure. I remember answering the question on the survey and not selecting Libertarian, because of the strong ideological opposition to state intervention. For me, all strong ideological positions feel like trying to force-fit the observational square peg into the ideologically round hole.)

So you can ask whether each test actually tested the hypothesis, or whether they tested a new hypothesis (e.g. 'Libertarian is a proxy for cognitive closure' - probably not), but otherwise you shouldn't treat serial testing of the same hypothesis as multiple hypothesis testing. The key is to ask whether you're making a prediction before the beginning of the experiment or not.


Expand full comment

On rerandomization, see here:


"We show that rerandomization creates a trade-off between subjective performance and robust performance guarantees. However, robust performance guarantees diminish very slowly with the number of rerandomizations. This suggests that moderate levels of rerandomization usefully expand the set of acceptable compromises between subjective performance and robustness."

Expand full comment

The goal of randomization is ONLY to ensure the two groups are as similar as possible at the start of your experiment. There’s nothing statistically fancy about it. We use randomization because it’s easy, and because it should cover potential confounding factors we didn’t think of, as well as the ones we did.

So there would be nothing wrong with re-randomizing until you got any obvious confounding factors pretty even, as long as you do it before starting the experimental procedure.

Or you could pair your participants up on the few variables you expect to be confounding, and randomly assign them to your groups. ie; take 2 women between 45 and 50 ys, both with healthy blood pressure and no diabetes from your sample. Then flip a coin to see which goes into which group. Then take 2 men between 70 and 75 ys, both moderately high blood pressure but no diabetes, flip a coin ....

Expand full comment

About problem II, as you write you're testing the same hypothesis in a few different ways, not many disparate hypotheses. In this case, it's best to perform a test of joint significance, like an F-test. This is a single test which takes into account all the effects in the different outcome variables together to figure out if the difference between groups is significant.

Expand full comment

" Should they have checked for this right after randomizing, noticed the problem, and re-rolled their randomization to avoid it? I've never seen anyone discuss this point before." This is called dynamic randomization. Assignment depends on the current subjects strata AND all the strata before. If I recall, it is hard to do a re-randomization test, and you can not always guarantee the large sample properties of your test statistics.


Expand full comment

About re-rolling the randomization in case of a bad assignment of people to groups, in addition to Stratified Sampling that others have pointed out, I’d like to point out that re-rolling is called Rejection Sampling, and it is a useful thing to do.

Suppose you want to sample uniformly from some set S (say, possible assignments of people into equally-sized groups with no significant difference in blood pressure), but you don’t have an easy way to sample from S. If S is a subset of a larger set T (say, possible assignments of people into equally-sized groups), and you have a way to test whether an element of T belongs to S, and you do have a way to uniformly sample from T, then you can turn this into a way to uniformly sample from S, using the re-rolling method you describe: draw an element of T, and if it’s not in S, try again. This is called Rejection Sampling.

Expand full comment

“But this raises a bigger issue - every randomized trial will have this problem. Or, at least, it will if the investigators are careful and check many confounders. [...] I don't think there's a formal statistical answer for this.”

I don’t think this is quite right. Treatment effect estimation in control trials doesn’t necessarily require randomisation, and either way one can balance groups on characteristics observed before the trial. See e.g. Kasy (2016): https://scholar.harvard.edu/files/kasy/files/experimentaldesign.pdf

Expand full comment

Agree, well said, been saying similar for years. Here the 'MHT correction' *helps them*.

But researchers shouldn't give themselves the benefit of the doubt when dealing with diagnostic testing nor considering the possibility of a problem with the design that could have a massive impact on the accuracy of the study.

We don't need to 'strongly reject the null' when we smell smoke, before we are worried that there is a fire in our laboratory.

We really need an approach that puts boundaries on the *extent of the possible bias* and then the researcher needs to demonstrate that the 'extreme probability bounds' on such a bias are still fairly small.

I started to discuss this in my general notes [here](https://daaronr.github.io/metrics_discussion/robust-diag.html) ... hope to elaborate

Expand full comment

Your ambidexterity/authoritarianism example sounds like a situation where principal component analysis (PCA) could be helpful. Rather than using four different questions that reflect a single underlying, latent concept, you use PCA to tease out a better single measure of the target latent concept. This is used for economic predictions, when many different indicators are thought to give information about the overall state of the economy, but also in political science where the same logic applies.

Expand full comment