273 Comments
Comment deleted
Expand full comment
founding

I have no idea what the actual scientists would say, but the mechanism of "Vitamin D regulates calcium absorption, and calcium is really important for the activation of the immune system, and the immune system fights off viral infections" seems like a decent guess! At least strong enough to make testing worthwhile and, given enormous D-deficiency rates in many of the tested populations, a reasonable candidate for hypothesis testing.

Expand full comment

Shouldn't "19:1 in factor" be actually "19:1 in favor"?

Expand full comment
author

Thanks, fixed.

Expand full comment

Bayesian statistics require you to compare the chance of the outcome given the null hypothesis, Vs the chance of the outcome given h1.

So the chance of getting p=0.5 given the null hypothesis is very high, but very low given h1, so it should significantly update you towards h0.

Expand full comment

The correct way to do it from a Bayesian perspective is to compare the probability density of the exact outcome you got given h0 Vs h1, and update according to their ratio.

In practice the intuition is that if you were 100 % sure you would get around this result given h1, then a p = 0.05 would indeed make you 20 times as certain in h1. If you expected to get a bit higher or a bit lower, it would make you less than 20 times as certain (or even reduce your certainty). If you expected to get pretty much exactly that result, it could make you thousands of times more certain.

TLDR: you have to look at both p given h0, and p given h1 to do Bayesian analysis

Expand full comment

Right. Except that p=0.05 means 5% chance of this outcome or an outcome more extreme, so if we are talking about densities as you say and we were sure we would get this *exact* outcome in h1 world and we got it, that should make us update way more than 20 times.

Expand full comment

Isn't that what I wrote?

> If you expected to get pretty much exactly that result, it could make you thousands of times more certain.

Expand full comment

Yeah, I misread. But I still don't get what you meant by "if you were 100 % sure you would get around this result given h1, then a p = 0.05 would indeed make you 20 times as certain in h1.". Are you just describing a common wrong intuition, or claiming that the intuition makes sense if you were sure your result would be "around" in h1? If so, why does that make sense?

Expand full comment

Yeah I wasn't very precise there. What I think I meant is that for some distributions of h1 it would be around 20 times higher. I think it's too late to edit my post, but if I could I would delete that.

Expand full comment

Maybe, if in h1 it was around 100% sure that one would get a result equal or more extreme than the actual result, then one would get that 20 factor, right?

This clearly wouldn't be using ALL the information one has. It feels weird though to choose what subset of the information to use and which to ignore when updating depending on the result.

I need to study this.

Expand full comment

^ this. your test 3 should not give you a 1:1 Bayes factor

Expand full comment

because the libertarian result is pretty likely if the authoritarianism thesis is false, (that's where the p-value is coming from) but unlikely if the authoritarianism thesis is true. That's why you can't do any of this with just p-values.

Expand full comment

The prior probability of h0 is 0 (no pair of interesting variables in a complex system have zero association) so in a Bayesian analysis no data should update you in favor (or against) it.

Expand full comment

You have to look at probability densities not probabilities

Expand full comment

I don't get that--probabilities are the thing that has real world interpretation we care about. In a Bayesian analysis it just makes more sense to think in terms of distributions of effect sizes than null hypothesis testing.

Expand full comment

True Bayesian analysis is impossible. It requires enumerating the set of all possible universes which are consistent with the evidence you see, and calculation what percentage have feature X youre interested in. This is an approximation of Bayesian analysis which updates you towards either h0 (these variables are not strongly linked) and h1 (these variables are strongly linked).

Expand full comment

"All possible universes" is just the 4D hypercube representing the possible true differences in group means for the four questions considered. Given a prior on that space it is straightforward to calculate the posterior from the data. The only tricky part is figuring out how to best represent your prior.

Expand full comment

Yes Bayesian techniques are exactly meant to deal with issues like this.

Expand full comment

The key is to have a large enough study such that if there are 20 potentially relevant factors, even though one of them will probably show a significant difference between groups, that difference will be too small to explain any difference in results. Here the study was tiny, so one group had 25% high blood pressure and the other over 50%.

Expand full comment

Yeah, as a reminder, the fluctuations are inversely proportional to the square root of the number of random processes.

Expand full comment

This is a very good point. However it's not always possible to just "run a bigger study". (In this case, sure, that's probably the right answer). Is it possible to do better with the small sample-size that we have here?

For example, I'm wondering if you can define a constraint on your randomized sampling process that prevents uneven distributions -- "reroll if it looks too biased" in Scott's phrasing, or possibly reweighting the probabilities of drawing each participant based on the population that's been drawn already. Essentially you want the sampled groups to have a distribution in each dimension that matches the sampled population, and that match each other. I'm sure this is something that's common in clinical trial design, and I'll try to dig up some citations for how this is done in the industry.

For example see "quota sampling" and "stratified sampling" here (quick google search so no validation on quality): https://www.healthknowledge.org.uk/public-health-textbook/research-methods/1a-epidemiology/methods-of-sampling-population, which seems relevant, but doesn't go deep enough to analyze when you'd need to use these techniques.

Expand full comment

What you're referring to is called blocking. At the most extreme you pair each person to one other person in such a way that the expected imbalance after randomization is minimized.

Expand full comment

Good career advice: never work with children, animals, or small sample sizes.

Expand full comment

Just to clarify, since I'm not totally sure I get the argument, is the idea here that as sample size increases, you are less likely to end up with a large difference in the groups (e.g., in this case, you are less likely to end up with wildly different proportions of people with high blood pressure)?

I think it's worth noting that testing for statistically significant differences in covariates is nonsensical. If you are randomizing appropriately, you know that the null hypothesis is true - the groups differ only because of random chance. If you end up with a potentially important difference between the groups, it doesn't really matter if it's statistically significant or not. Rather, it matters to whatever extent the difference influences the outcome of interest.

In this case, if blood pressure is strongly related to covid recovery, even with a large sample size and a small difference between the groups, it would be a good idea to adjust for blood pressure, simply because you know it matters and you know the groups differ with respect to it.

Note that, while it's true that larger sample sizes will be less likely to produce dramatic differences in covariates between randomized treatment and control groups, larger sample sizes will also increase the chance that even small differences in relevant covariates will be statistically significantly related to differences in the outcome of interest, by virtue of increasing the power to detect even small effects.

Expand full comment

On the topic of your point in part I, this seems like a place where stratified sampling would help. Essentially, you run tests on all of your patients, and then perform your random sampling in such a way that the distribution of test results for each of the two subgroups is the same. This becomes somewhat more difficult to do the more different types of tests you run, but it shouldn't be an insurmountable barrier.

It's worth mentioning that I've seen stratified sampling being used in some ML applications for exactly this reason - you want to make sure your training and test sets share the same characteristics.

Expand full comment

I was going to propose something like this, unaware it already had a name. I imagine it's harder to stratify perfectly when have 14 conditions and 76 people but it would be the way to go.

If someone hasn't already, someone should write a stratification web tool that doctors or other non-statisticians can just plug in "Here's a spreadsheet of attributes, make two groups as similar as possible". It should just be something you are expected to do, like doing a power calculation before setting up your experiment.

Expand full comment

This is a thing. It's called minimisation or adaptive randomisation. It works, and should be recommended more often for small studies like this.

Expand full comment

I am a couple of days late, but yes, this is what I would suggest, with caveat I don't have much formal studies in experimental design. I believe in addition to "stratified sampling", another search key word one wants is "blocking".

Intuitively, when considering causality, it makes sense. One hypothesizes that two things, confounding factor A and intervention B could both have an effect, and one is interested in the effect of the intervention B. To be on the safe side

The intuitive reason here is that randomization is not magic, but serves a purpose: reducing statistical bias from unknown confounding factors.

Abstracted lesson I have learned from several practical mistakes over the years: Suppose I have 1000 data items, say, math exams scores of students from all schools in a city. As far as I am concerned, I do not know if there is any structure in the data, so I call it random and I split the samples into two groups of 500, first 50 in one and rest in another? What if later I have been told that the person who collected the data reported them in a spatial order of the school districts as they parsed the records, forgot to include the district information, but now tells me the first 500 students includes pupils from the affluent district with the school that has nationally renowned elite math program? Unknown factor has now become known factor, while the data and sample order remained the same! Doing randomization by yourself is the way I could have avoided this kind of bias. But if the person tells me the districts of the students beforehand, I can actually take this in account while designing my statistical model.

Expand full comment

Were those really the only questions you considered for the ambidexterity thing? I had assumed, based on the publication date, that it was cherry-picked to hell and back.

Expand full comment
author

No, it was completely legit.

Expand full comment

Guess I should reread it taking it seriously this time. (Not the first time I've had to do that with something published Apr 1)

Expand full comment

Please forgive me for remaining a teeny bit suspicious.

Expand full comment

There's good reason to be suspicious of the result (as I'm sure Scott would agree - one blogger doing one test with one attempted multiple-hypothesis test correction just isn't that strong evidence, especially when it points the opposite direction of the paper he was trying to replicate).

I think there's no reason to be suspicious that Scott was actually doing an April Fools - it just wasn't a good one, and Scott tends to do good twist endings when he has a twist in mind.

Expand full comment

Agreed, though that post would have benefited a lot from being posted on another day (or have some kind of not-an-april-fools-joke disclaimer).

Expand full comment

I agree with you. I was asking for forgiveness because despite there being no reason for suspicion, I found that I was, indeed, still suspicious.

Expand full comment

One comment here: if you have a hundred different (uncorrelated) versions of the hypothesis, it would be *hella weird* if they all came back around p=.05. just by random chance, if you'd expect any individual one of them to come back at p=0.05, then if you run 100 of them you'd expect them one of them to be unusually lucky and get back at p=.0005 (and another one to be unusually unlucky and end up at p=0.5 or something). Actually getting 100 independent results at p=.05 is too unlikely to be plausible.

Of course, IRL you don't expect these to be independent - you expect both the underlying thing you're trying to predict and the error sources to be correlated across them. This is where it gets messy - you sort of have to guess how correlated these are across your different metrics (e.g. high blood pressure would probably be highly correlated with cholesterol or something, but only weakly correlated with owning golden retrievers). And that I'm itself is kind of a judgement call, which introduces another error source.

Expand full comment

You can empirically determine the correlation between those different metrics (e.g., R^2) and correct the multiple hypothesis test statistic.

Expand full comment

What I generally use and see used for multiple testing correction is the Benjamini-Hochberg method - my understanding is that essentially it looks at all the p-values and for each threshold X it compares to how many p-values<X you got vs how many you'd expect by chance, and adjusts them up depending on how "by chance" they look based on that. In particular, in your "99 0.04 results and one 0.06" example it only adjusts the 0.04s to 0.040404.

But generally speaking you're right, all those methods are designed for independent tests, and if you're testing the same thing on the same data, you have no good way of estimating how dependent one result is on another so you're sort of screwed unless you're all right with over-correcting massively. (Really what you're supposed to do is pick the "best" method by applying however many methods you want to one dataset and picking your favorite result, and then replicate that using just that method on a new dataset. Or split your data in half to start with. But with N<100 it's hard to do that kind of thing.)

Expand full comment

For a bayesian analysis, it's hard to treat "has an effect" as a hypothesis and you probably don't want to. You need to treat each possible effect size as a hypothesis, and have a probability density function rather than a probability for both prior and posterior.

You *could* go the other way if you want to mess around with dirac delta functions, but you don't want that. Besides, the probability that any organic chemical will have literally zero effect on any biological process is essentially zero to begin with.

Expand full comment

I want a like button for this comment.

Expand full comment

And the effect size is what you actually want to know, if you are trying to balance the benefits of a drug against side effects or a of social programme against costs.

Expand full comment

Has there already been any formal work done on generating probability density functions for statistical inferences, instead of just picking an overly specific estimate? This seems like you're on to something big.

Meteorology and economics, for example, return results like "The wind Friday at 4pm will be 9 knots out of the northeast" and "the CPI for October will be 1.8% annualized", which inevitably end up being wrong almost all of the time. If I'm planning a party outside Friday at 4pm I'd much rather see a graph showing the expected wind magnitude range than a single point estimate.

Expand full comment

The concept you want is Conjugate Priors.

If the observations all follow a certain form, and the prior follows the conjugate form, the posterior will also follow the conjugate form, just with different parameters. Most conjugate forms can include an extremely uninformative version, such as a gaussian with near-infinite variance.

The gaussian's conjugate prior is itself, which is very convenient.

The conjugate prior for bernoulli data (a series of yes or nos) is called a Beta Distribution (x^a(1-x)^b) and is probably what we want here. It's higher-dimensional variation is called a Dirichlet Distribution.

Expand full comment

This might interest you:

https://arxiv.org/abs/2010.09209

Expand full comment

Very cool, I'll dig in, thanks both

Expand full comment

This is true, Scott's "Bayes factors" are completely the wrong ones. It's also true that the "real" Bayes factors are hard to calculate and depend on your prior of what the effect is, but just guessing a single possible effect size and calculating them will give you a pretty good idea — usually much better than dumb old hypothesis testing.

For example, let's assume that the effect size is 2 SD. This way any result 1 SD and up will give a Bayes factor in favor of the effect, anything 1 SD or below (or negative) will be in favor of H0. Translating the p values to SD results we get:

1. p = 0.049 ~ 2 SD

2. p = 0.008 ~ 2.65

3. p = 0.48 ~ 0

4. p = 0.052 ~ 2

For each one, we compare the height of the density function at the result's distance from 0 and from 2 (our guessed at effect), for example using NORMDIST(result - hypothesis,0,1,0) in Excel.

1. p = 0.049 ~ 2 SD = .4 density for H1 and .05 density for H0 = 8:1 Bayes factor in favor.

2. p = 0.008 ~ 2.65 = .32 (density at .65) and .012 (density at 2.65) = 27:1 Bayes factor.

3. p = 0.48 ~ 0 = 1:8 Bayes factor against

4. p = 0.052 ~ 2 = 8:1 Bayes factor in favor

So overall you get 8*8*27/8 = 216 Bayes factor in favor of ambidexterity being associated with authoritarian answers. This is skewed a bit high by a choice of effect size that hit exactly on the results of the two of the questions, but it wouldn't be much different if we chose 1.5 or 2.5 or whatever. If we used some prior distribution of possible positive effect sizes from 0.1 to infinity maybe the factor would be 50:1 instead of 216:1, but still pretty convincing.

Expand full comment

I don't understand what you are doing here. Aren't you still double-counting evidence by treating separate experiments as independent? After you condition on the first test result, the odds ratio for the other ones has to go down because you are now presumably more confident that the 2SD hypothesis is true.

Expand full comment

You're making a mistake here. p = 0.049 is nearly on the nose if the true effect size is 2 SD *and you have one sample*. But imagine you have a hundred samples, then p = 0.049 is actually strong evidence in favor of the null hypothesis. In general the standard deviation of an average falls like the square root of the sample size. So p = 0.049 is like 2/sqrt(100) = .2 standard deviations away from the mean, not 2.

Expand full comment

Not directly answering the question at hand, but there's a good literature that examines how to design/build experiments. One paper by Banerjee et al. (yes, the recent Nobel winner) finds that you can rerandomize multiple times for balance considerations without much harm to the performance of your results: https://www.aeaweb.org/articles?id=10.1257/aer.20171634

So this experiment likely _could_ have been designed in a way that doesn't run into these balance problems

Expand full comment

There is also a new paper by the Spielman group at Yale that examines how to balance covariates: https://arxiv.org/pdf/1911.03071.pdf

Expand full comment

So this is fine in studies on healthy people where you can give all of them the drug on the same day. If you’re doing a trial on a disease that’s hard to diagnose then the recruitment is staggered and rerandomizing is impossible.

Expand full comment

In the paper I linked above, Appendix A directly speaks to your concern on staggered recruitment. Long story short, sequential rerandomization is still close to first-best and should still be credible if you do it in a sensible manner.

Expand full comment
founding

On the ambidextrous analysis, I think what you want for a Bayesian approach is to say, before you look at the results of each of the 4 variables, how much you think a success or failure in one of them updates your prior on the others. e.g. I could imagine a world where you thought you were testing almost exactly the same thing and the outlier was a big problem (and the 3 that showed a good result really only counted as 1.epsilon good results), and I could also imagine a world in which you thought they were actually pretty different and there was therefore more meaning to getting three good results and less bad meaning to one non-result. Of course the way I've defined it, there's a clear incentive to say they are very different (it can only improve the likelihood of getting a significant outcome across all four), so you'd have to do something about that. But I think the key point is you should ideally have said something about how much meaning to read across the variables before you looked at all of them.

Expand full comment
founding

Oops I missed that Yair said a similar thing but in a better and more mathy way

Expand full comment

"I don't think there's a formal statistical answer for this."

Matched pairs? Ordinarily only gender and age, but in principal you can do matched pairs on arbitrarily many characteristics. You will at some point have a hard time making matches, but if you can match age and gender in, say 20 significant buckets, and then you have say five binary health characteristics you think might be significant, you would have about 640 groups. You'd probably need a study of thousands to feel sure you could at least approximately pair everyone off.

Expand full comment

Hmm, wondering if you can do a retrospective randomized matched pairs subset based on random selections from the data (or have a computerized process to do a large number of such sub-sample matchings). Retrospectively construct matched pair groups on age, gender and blood pressure. Randomize the retrospective groupings to the extent possible. Re-run the analysis. Redo many times to explore the possible random subsets.

Expand full comment

Isn't the point to put each member of a matched pair into the control or the intervention group? If so you'd need to do it first (i.e. in Scott's example before administering the vitamin D).

Expand full comment

Not necessarily so. You could I think take an experiment that hadn't been constructed as a matched pair trial and then choose pairs that match on your important variables, one from treatment, one from control, as a sub-sample. If for any given pair you choose there are multiple candidates in the treatment and control group to select, you can "randomize" by deciding which two candidates constitute a pair. Now you've got a subsample of your original sample consisting of random matched pairs, each of which has one treatment and one control.

Expand full comment

Post hoc you have a set of records, each of which has values for control (yes/no), age, blood pressure, etc., and outcome. You can then calculate a matching score on whichever parameters you like for each pair of intervention-control subjects. Then for each intervention subject you can rank the control subjects by descending match score and select the top 1 (or more if desired). This is matching with replacement, but could also be done without replacement.

There are many ways to achieve this after the fact is what I am saying. None are perfect but most are reasonable.

Expand full comment

I'm not convinced there is actually an issue. Whenever we get a positive result in any scientific experiment there is always *some* chance that the result we get will be random chance rather than because of a real effect. All of this debate seems to be about analyzing a piece of that randomness and declaring it to be a unique problem.

If we do our randomization properly, on average, some number of experiments will produce false results, but we knew this already. It is not a new problem. That is why we need to be careful to never put too much weight in a single study. The possibility of these sorts of discrepancies is a piece of that issue, not a new issue. The epistemological safeguards we already have in place handle it without any extra procedure to specifically try to counter it.

Expand full comment
Comment deleted
Expand full comment

Matched pair studies seem fine, and I see the benefit of them, I am just not convinced they are always necessary, and they can't solve this problem.

Ultimately there are an arbitrary number of possible confounding variables, and no matter how much matching you do, there will always be some that "invalidate" your study. You don't even know which are truly relevant. If you did humanity would be done doing science.

If you were able to do matched pair study that matched everything, not just age and gender, you would have to be comparing truly identical people. At that point, your study would be incredibly powerful, it would be something fundamentally better than an RCT, but obviously this is impossible.

Expand full comment

In any given study one starts with the supposition that some things have a chance of being relevant. For example, you may think some treatment, such as administering vitamin D, has a chance of preventing an illness or some of its symptoms. There are other things that are outside what you can control or affect that you also think may well be relevant, though you hope not too much. And finally there are an unlimited number of things that you think are very unlikely or that you have no reason to believe would be relevant but that you cannot rule out.

You seem to be suggesting everything in the second category should be treated as though it were in the third category, or else everything in the third category ought to be treated as though it belonged in the second category. But the world is not like that. It is not the case that the existence of things in the third category means that there are always more things that should have been in the second category. The two categories are not the same.

Expand full comment

Although the third category is larger than the second category, the second category is also practically unlimited. Also, the exact location of the border between the two is subjective.

It seems like it would be impossible to explicitly correct for every possible element of the second category, but if you don't, it isn't clear that you are accomplishing very much.

Expand full comment

It is not practically unlimited. At any given time there will typically be a small finite number of confounders of serious importance. Blood pressure is an obvious one. You are making a slippery slope argument against dealing with confounders, but there's no slope, much less a slippery one.

Expand full comment

It seems to me to be a huge number. I am considering:

- Preexisting medical conditions

- Age, gender, race/ethnicity

- Other drugs/medical care (both for COVID and for preexisting conditions)

- Environmental factors

- Every possible permutation of previously elements on the list

Which do you think aren't actually relevant?

I am not making a slippery slope argument. I'm not sure if you are misinterpreting my comment, or misusing the name of the argument, but either way, you are incorrect. If you clarify what you meant, I will explain in more detail.

Expand full comment

The idea behind matched pair studies is that the pairing evens out Althea known potentially confounding variables, while the randomization within the pair (coin toss between the two as to who goes into experimental group and who into control) should take care of the unknown ones.

Expand full comment

It's not a new problem, but stratifications exist to solve it. There are solutions to this issue.

Expand full comment

Stratification are also a valuable tool, but I am not convinced they are necessary either, and using them inherently introduces p-hacking concerns and weakens the available evidence.

Expand full comment

Both stratification and matching exist to deal with problems like this. Maybe a lot of times it's not necessary, but this post is entirely about a case where it might be: because of the high blood pressure confounder. I'm not sure what to make of the idea that the solution is worse than the problem: why do you think that?

Expand full comment

I do not believe that the solution is worse than the problem, at least not in such an absolute sense.

What I actually believe is that the solution is not generally necessary. I also believe that in most situations where the experiment has already been done, the fact that one of these solutions had not been applied shouldn't have a significant impact on our credence about the study.

Expand full comment

This is an awful lot of weight to put on your beliefs or opinions about the matter.

Expand full comment

I don't understand what you are trying to say here. How much weight do you think I am putting on my beliefs and opinions?

Expand full comment
author

There's a difference between saying "we know some of them will do this" and "we have pretty good evidence it was this one in particular"

Expand full comment

Clarification question:

When you say "some of them", do you mean some of the studies, or some of the variables? I think you meant some of the studies, so I am going to respond to that. Please correct me if I am wrong.

I think that this is happening in (almost?) all of the studies. It is just a question of if we happen to notice the particular set of variables that it is happening for. I think that the section of TFA about golden retrievers is a little bit misleading. Even considering only variables that could conceivably be relevant, there are still a nearly infinite number of possible variables. The question of if it gets noticed for a particular study is arbitrary, and more related to which variables happened to get checked than the strength of the study itself.

Expand full comment

I would agree; when we say "there is a 5% chance of getting this result by random chance", then this is exactly the sort of scenario which is included in that five percent.

But what is currently doing my head in is this: once we know that there _is_ a significant difference between test and control groups in a potentially-significant variable, are we obliged to adjust our estimate that the observed effect might be due to random chance?

And if we are, then are we obliged to go out fishing for every other possible difference between our test and control groups?

Expand full comment

I agree with you. I think there are decent arguments on both sides.

Arguments in favor of doing these corrections:

- Once we identify studies where there is a difference in a significant variable, that means that this particular study is more likely to be the result of chance

- Correcting can only improve our accuracy because we are removing questions where the randomization is causing interference.

Arguments against doing the corrections:

- There is always a significant variable that is randomized poorly (because of how many there are), when we notice it, that tells us more about what we can notice than it does about the study itself.

- A bunch of these sorts of errors counter each other out. Removing some of them is liable to have unforeseen consequences.

- Ignoring the issue altogether doesn't do any worse on average. Trying to correct in only specific instances could cause problems if you aren't precise about it.

It seems complicated enough that I don't fully know which way is correct, but I am leaning against making the corrections.

Expand full comment

If your original confidence level included the possibility that the study might have had some bias that you haven't checked for, then in principle, when you subsequently go fishing for biases, every bias you find should give you less faith in the study, BUT every bias that you DON'T find should give you MORE faith. You are either eliminating the possible worlds where that bias happened or eliminating the possible worlds where it didn't happen.

Of course, counting up all the biases that your scrutiny could've found (but didn't) is hard.

Expand full comment

I don't think it quite works like that. The possible biases are equally likely to help you or hurt you. Before you actually examine them their expected impact is 0 (assuming you are running a proper RCT). Every bias that you find that find that helped the result resolve as it did would give you less credence in that result. Every bias that was in the opposite direction would give you more. The biases you don't know about should average out.

Expand full comment

Wouldn't you a posteriori expect to find a positive impact from bias given that the result was positive?

Expand full comment

I guess that depends on your priors on the effect size?

Expand full comment

You’re right. This is a good point. It is only 0 if you are talking about all studies attempted.

Expand full comment

You basically never want to be trying to base your analysis on combined P factors directly. You want to--as you said--combine together the underlying data sets and create a P factor on that. Or, alternately, treat each as a meta-data-point with some stdev of uncertainty and then find the (lower) stdev of their combined evidence. (Assuming they're all fully non-independent in what they're trying to test.)

Expand full comment

Good thoughts on multiple comparisons. I made the same point (and a few more points) in a 2019 Twitter thread: https://twitter.com/stuartbuck1/status/1176635971514839041

Expand full comment

A good way to handle the problem of failed randomizations is with re-weighting based on propensity scores (or other similar methods, but PSs is the most common). In brief, you use your confounders to predict the probability of having received the treatment, and re-weight the sample depending on the predicted probabilities of treatment. The end result of a properly re-balanced sample is that, whatever the confouding effect of blood pressure on COVID-19 outcomes, it confounds both treated and untreated groups with equal strength (in the same direction). Usually you see this method talked about in terms of large observational data sets, but it's equally applicable to anything (with the appropriate statistical and inferential caveats). Perfectly balanced data sets, like from a randomized complete block design, have constant propensity scores by construction, which is just another way of saying they're perfectly balanced across all measured confounders.

For the p-value problem, whoever comes in and talks about doing it Bayesian I believe is correct. I like to think of significance testing as a sensitivity/specificity/positive predictive value problem. A patient comes in from a population with a certain prevalence of a disease (aka a prior), you apply a test with certain error statistics (sens/spec), and use Bayes' rule to compute the positive predictive value (assuming it comes back positive, NPV otherwise). If you were to do another test, you would use the old PPV in place of the original prevalence, and do Bayes again. Without updating your prior, doing a bunch of p-value based inferences is the same as applying a diagnostic test a bunch of different times without updating your believed probability that the person has the disease. This is clearly nonsense, and for me at least it helps to illustrate the error in the multiple hypothesis test setting.

Finally, seeing my name in my favorite blog has made my day. Thank you, Dr. Alexander.

Expand full comment

> Should they have checked for this right after randomizing, noticed the problem, and re-rolled their randomization to avoid it?

The point of "randomizing" is to drown out factors that we don't know about. But given that we know that blood pressure is important, it's insane to throw away that information and not use it to divide the participant set.

I think the proper way to do this might be stratified sampling [1]. Divide the population into all relevant subgroups that you know about and then sample from each subgroup at the same rate to fill your two groups.

[1]: https://en.wikipedia.org/wiki/Stratified_sampling

Expand full comment

A very simple test I'd like to see would be to re-run the analysis with all high-blood-pressure patients removed. (Maybe that's what they did when 'controlling for blood pressure' - or maybe they used some statistical methodology. The simple test would be hard to get wrong.)

Expand full comment

If you did that you'd have only eleven patients left in the control group, which I'm gonna wildly guess would leave you with something statistically insignificant.

Expand full comment

Well, that's the trouble with small tests. If the blood pressure confound is basically inseparable, and blood pressure is a likely Covid signifier, the result is just not all that strong for Vitamin D. One's priors won't get much of a kick.

Expand full comment

[Should it be posteriors that get a kick? Seems like a pun that we'd have heard before...]

Expand full comment

Adaptive randomisation does this but preserves the unpredictability of assignment from the perspective of the investigator: https://www.hilarispublisher.com/open-access/a-general-overview-of-adaptive-randomization-design-for-clinical-trials-2155-6180-1000294.pdf

Expand full comment
founding

>I think the proper way to do this might be stratified sampling [1]. Divide the population into all relevant subgroups that you know about and then sample from each subgroup at the same rate to fill your two groups.

I don't think that works for an n=76 study with fifteen identified cofounders, or at least the simple version doesn't. You can't just select a bunch of patients from the "high blood pressure" group and then from the "low blood pressure" group, and then from the "over 60" group and the "under 60 group", and then the "male" group and the "female group", etc, etc, because each test subject is a member of three of those groups simultaneously. By the time you get to e.g. the "men over 60 with low blood pressure" group, with n=76 each of your subgroups has an average of 9.5 members. With fifteen cofounders to look at, you've got 32,768 subgroups, each of which has an average of 0.002 members.

If you can't afford to recruit at least 65,000 test subjects, you're going to need something more sophisticated than that.

Expand full comment

This topic is actually well discussed among randomista econometricians. I believe they used to advise "rerolling" until you get all confounders to be balanced, but later thought it might create correlations or selection on *unobservable* confounders, so weakly advised against it.

I agree that stratification of some sort is what I would try.

For something more sophisticated, see these slides[1] by Chernozhukov which suggest something called post-double-selection.

The post-double-selection method is to select all covariates that predict either treatment assignment or the outcome by some measure of prediction (t-test, Lasso, ...). Then including those covariates in the final regression, and using the same confidence intervals.

[1] https://stuff.mit.edu/~vchern/papers/Chernozhukov-Saloniki.pdf

Expand full comment

Regarding Bayes. If a test (that is independent of the other tests) gives a bayes a factor of 1:1, then that means that the test tells you nothing. Like, if you tested the Vitamin D thing by tossing a coin. It's no surprise that it doesn't change anything.

Expand full comment

If a test for vit D and covid has lots of participants and the treatment group doesn't do better, that's not a 1:1 bayes factor for any hypotheses where vit d helps significantly. The result is way less likely to happen in the world where vit d helps a lot.

Expand full comment

People, correct me if I'm mistaken or missing the point. This is the image in my mind:

https://ibb.co/vcqW9HQ

The horizontal axis are the results of a test. In black, the probability density in the null world (vit D doesn't help for covid). In blue, the prob density in some specific world (vit D helps in covid in *this* specific way).

The test result comes out as marked in red. The area to the right of that point is the p. It only depends on the black curve, so we don't need to get too specific about our hypothesis other than to know that the higher values are more likely in the hypothetical worlds.

The bayes factor would be the relative heights of the curves at the point of the result. Those depend on the specific hypotheses, and can clearly take values greater or smaller than 1 for "positive test results".

Expand full comment

1. Did someone try to aggregate the survival expectation for both groups (patient by patient, then summed up) and control for this?

Because this is the one and main parameter.

2. Is the "previous blood pressure" strong enough a detail to explain the whole result?

3. My intuition is that this multiple comparison thing is way too dangerous an issue to go ex post and use one of the test to explain the result.

This sounds counter intuitive. But this is exactly the garden of forked path issue. Once you go after the fact to select numbers, your numbers are really meaningless.

Unless of course you happen to land on the 100% smoker example.

But!

You will need a really obvious situation, rather than a maybe parameter.

Expand full comment

Rerolling the randomization as suggested in the post, doesn't usually work because people are recruited one-by-one on a rolling basis.

But for confounders that are known a priori, one can use stratified randomization schemes, e.g. block randomization within each stratum (preferably categories, and preferably only few). There are also more fancy "dynamic" randomization schemes that minimize heterogeneity during the randomization process, but these are generally discouraged (e.g., EMA guideline on baseline covariates, Section 4.2).

In my naive understanding, spurious effects due to group imbalance are part of the game, that is, included in the alpha = 5% of false positive findings that one will obtain in the null hypothesis testing model (for practical purposes, it's actually only 2.5% because of two-sided testing).

But one can always run sensitivity analyses with a different set of covariates, and the authors seem to have done this anyway.

Expand full comment

I think I've read papers where the population was grouped into similar pairs and each pair was randomized. I seems to me that the important question is not so much rolling recruitment, but speed, in particular time from recruitment and preliminary measurement to randomization. Acute treatments have no time to pair people, but some trials have weeks from induction to treatment.

Expand full comment

FYI, you can next-to-guarantee that the treatment and control groups will be balanced across all relevant factors by using blocking or by using pair-matching. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4318754/#:~:text=In%20randomized%20trials%2C%20pair%2Dmatching,best%20n%2F2%20matched%20pairs.

Expand full comment

This is a nice trick with limited applicability in most clinical trial settings, because it requires that you know all of your subjects' relevant baseline characteristics simultaneously prior to randomizing in order to match them up. They could do that in their HIV therapy trial example because they would get "batches" of trial eligible subjects with preexisting HIV infection. In the COVID-Vit D study, and most others, subjects come one at a time, and expect treatment of some kind right away.

Expand full comment

Nope, you could take the first person who arrives, say female between age 60 and 70, high blood pressure but no diabetes, and flip a coin to see which group she goes into. Treat her, measure what happens. Continue doing this for each new participant until you get a ‘repeat’; another female between 60 and 70, high blood pressure but no diabetes. She goes into whichever group that first woman didn’t go in.

Keep doing this until you’ve got a decent sized sample made up of these pairs. Discard the data from anyone who didn’t get paired up.

Expand full comment

What you've described is something different than what the paper talks about though. Your solution is basically a dynamic allocation with equal weighting on HTN and diabetes as factors, and the complete randomization probability or second best probability set to 0% (Medidata Rave and probably other databases can do this pretty easily). And while it would definitely eliminate the chances of a between group imbalance on hypertension or diabetes, I still don't see it being a popular solution for two reasons. First, because the investigators know which group a subject is going to be in if they are second in the pair; second, because it's not clear ahead of time that you don't just want the larger sample size that you'd get if you weren't throwing out subjects that couldn't get matched up. It's sort of a catch-22: small trials like the Vitamin D study need all the subjects they can get, and can't afford to toss subjects for the sake of balance that probably evens out through randomization anyway; large trials can afford to do this, but don't need to, because things *will* even out by the CLT after a few thousand enrollments.

Expand full comment

I think controlling for noise issues with regression is a fine solution for part 1. You can also ways of generating random groups subject to a restraint like "each group should have similar average Vitamin D." Pair up experimental units with similar observables, and randomly assign 1 to each group (like https://en.wikipedia.org/wiki/Propensity_score_matching but with an experimental intervention afterwards).

For question 2, isn't this what https://en.wikipedia.org/wiki/Meta-analysis is for? Given 4 confidence, intervals of varying widths and locations, you either: 1. determine the measurements are likely to be capturing different effects, and can't really be combined; or 2. generate a narrower confidence interval that summarizes all the data. I think something like random-effects meta analysis answers the question you are asking.

Expand full comment

Secondarily, 0 effect vs some effect is not a good Bayesian hypothesis. You should treat the effect as having some distribution, which is changed by each piece of information. The location and shape can be changed by any test result; an experiment with effect near 0 moves the probability mass towards 0, while an extreme result moves it away from 0.

Expand full comment

You shouldnt just mindlessly adjust for multiple comparisons by dividing the significance threshold by the number of tests. This Bonferroni adjustment is used to "controll the familywise error rate",(FWER), which is the probability of rejecting one hypothesis, given that they are all true null hypotheses. Are you sure that is what you want to controll for in your ambidextrois analysis? Its not abvious that is what you want.

Expand full comment

My former employer Medidata offers software-as-a-service (https://www.medidata.com/en/clinical-trial-products/clinical-data-management/rtsm) that lets you ensure that any variable you thought of in advance gets evenly distributed during randomization. The industry term is https://en.wikipedia.org/wiki/Stratification_(clinical_trials)

Expand full comment

By the way: Thanks for mentioning the "digit ratio" among other scientifically equally relevant predictors such as amount of ice hockey played, number of nose hairs, eye color, percent who own Golden Retrievers.

Made my day <3

Expand full comment

The easy explanation here is that the number of people randomized was so small that there was no hope of getting a meaningful difference. Remember, the likelihood of adverse outcome of COVID is well below 10% - so we're talking about 2-3 people in one group vs 4-5 in the other. In designing a trial of this sort, it's necessary to power it based on the number of expected events rather than the total number of participants.

Expand full comment

Hmm yes, a few of the responses have suggested things like stratified randomisation and matched pairs but my immediate intuition is that n = ~75 is too small to do that with so many confounders anyway.

Expand full comment
founding

I think you would want to construct a latent construct out of your questions that measures 'authoritarianism', and then conduct a single test on that latent measure. Perhaps using a factor analysis or similar to try to divine the linear latent factors that exist, and perusing them manually (without looking at correlation to your response variable, just internal correlation) to see which one seems most authoritarianish. And then finally measuring the relationship of that latent construct to your response variable, in this case, ambidexterity.

Expand full comment

I am now extremely confused (as distinct from my normal state of mildly confused), because I looked up the effects of Vitamin D on blood pressure.

According to a few articles and studies from cursory Googling, vitamin D supplementation will:

(1) It might reduce your blood pressure. Or it might not. It's complicated https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5356990/

(2) Won't have any effect on your blood pressure but may line your blood vessels with calcium like a furred kettle, so you shouldn't take it. Okay, it's kinda helpful for women against osteoporosis, but nothing doing for men https://health.clevelandclinic.org/high-blood-pressure-dont-take-vitamin-d-for-it-video/

(3) Have no effect https://www.ahajournals.org/doi/10.1161/01.hyp.0000182662.82666.37

This Chinese study has me flummoxed - are they saying "vitamin D has no effect on blood pressure UNLESS you are deficient, over 50, obese and have high blood pressure"?

https://journals.lww.com/md-journal/fulltext/2019/05100/the_effect_of_vitamin_d3_on_blood_pressure_in.11.aspx

"Oral vitamin D3 has no significant effect on blood pressure in people with vitamin D deficiency. It reduces systolic blood pressure in people with vitamin D deficiency that was older than 50 years old or obese. It reduces systolic blood pressure and diastolic pressure in people with both vitamin D deficiency and hypertension."

Expand full comment

Maybe the Cordoba study was actually backing up the Chinese study, in that if you're older, fatter, have high blood pressure and are vitamin D deficient then taking vitamin D will help reduce your blood pressure. And reducing your blood pressure helps your chances with Covid-19.

So it's not "vitamin D against covid", it's "vitamin D against high blood pressure in certain segments of the population against covid" which I think is enough to confuse the nation.

Expand full comment

You want to use 4 different questions from your survey to test a single hypothesis. I *think* the classical frequentist approach here would be to use Fisher's method, which tells you how to munge your p values into a single combined p: https://en.wikipedia.org/wiki/Fisher%27s_method

Fisher's method makes the fairly strong assumption that your 4 tests are independent. If this assumption is violated you may end up rejecting the null too often. A simpler approach that can avoid this assumption might be to Z-score each of your 4 survey questions and then sum the 4 Z-scores for each survey respondent. You can then just do a regular t-test comparing the mean sum-of-Z-scores between the two groups. This should have the desired effect (i.e. an increase in the power of your test by combining the info in all 4 questions) without invoking any hairy statistics.

Expand full comment

Btw. these folks are willing to bet $100k on that Vitamin D significantly reduces ICU admissions. https://blog.rootclaim.com/treating-covid-19-with-vitamin-d-100000-challenge/

Expand full comment

A few people seem to have picked up on some of the key issues here, but I'll reiterate.

1. The study should have randomized with constraints to match blood pressures between the groups. This is well established methodology.

2. Much of the key tension between the different examples is really about whether the tests are independent. Bayesianism, for example, is just a red herring here.

Consider trying the same intervention at 10 different hospitals, and all of them individually have an outcome of p=0.07 +/- 0.2 for the intervention to "work". In spite of several hospitals not meeting a significance threshold, that is very strong evidence that it does, in fact, work, and there are good statistical ways to handle this (e.g. regression over pooled data with a main effect and a hospital effect, or a multilevel model etc.). Tests that are highly correlated reinforce each other, and modeled correctly, that is what you see statistically. The analysis will give a credible interval or p-value or whatever you like that is much stronger than the p=0.05 results on the individual hospitals.

On the other hand, experiments that are independent do not reinforce each other. If you test 20 completely unrelated treatments, and one comes up p=0.05, you should be suspicious indeed. This is the setting of most multiple comparisons techniques.

Things are tougher in the intermediate case. In general, I like to try to use methods that directly model the correlations between treatments, but this isn't always trivial.

Expand full comment

One thing I'm wondering: is it possible to retroactively correct for the failure to stratify or match by randomly creating matched sub-samples, and resampling multiple times? Or does that introduce other problems.

Expand full comment

That's not too far from what they did, by trying to control for BP. It's reasonable, but it's still much better to stratify your sampling unless your sample is so big that it doesn't matter.

Expand full comment

There's a robust literature on post-sampling matching and weighting to control for failures in confounder balance. The simplest case is exact or 1-1 matching, where subjects in the treated and control sets with identical confounder vectors are "paired off," resulting in more balanced groups. A common tool here is propensity scores, which let you quantify how similar a treated versus control subject is with many covariates, or continuous covariates.

These kinds of techniques do change the populations you can make inferences about. What commonly happens is that you end up losing or downweighting untreated subjects (think large observational studies where most people are "untreated"), and so your inference is really only applicable to the treated population. But there are ways around it of course. If you're interested here's a good overview: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3144483/

Expand full comment

I don't see why the constraints are necessary. If you don't screw up the randomization (meaning the group partitions are actually equiprobable) the random variables describing any property of any member of one group are exactly the same as the other group, therefore if you use correct statistical procedures (eg, Fischers exact test on the 2x2 table: given vitamin d/not given vitamin D and died/alive), your final p-value should already contain the probability that, for example, every high blood-pressure person got into one group. If, instead, you use constraints on some property during randomization, who knows what that property might be correlated with? which would imo weaken the result.

Expand full comment

Here's why: if you force equally weighted grouping into strata (or matched pairs), then if the property is correlated with the outcome, you'll have a *better* or now worse understanding of the effect than if it's randomly selected for. If it's uncorrelated with the outcome, it does nothing. But if you ignore the property, and it's correlated, you may get a lopsided assignment on that characteristic.

Expand full comment

Sorry, that should read: "better (or no worse) understanding"

Expand full comment

JASSCC pretty much got this right. It generally hurts nothing to enforce balance in the randomization across a few important variables, but reduces noise. That's why it's standard in so many trials. Consider stylized example of a trial you want to balance by sex. In an unbalanced trial, you would just flip a coin for everyone independently, and run the risk of a big imbalance happening by chance. Or, you can select a random man for the control group, then one for the treatment. Then select a random woman for control, then one for treatment, etc. Everything is still randomized, but all the populations are balanced and you have removed a key source of noise.

Expand full comment

On the 'what could go wrong' point- what could go wrong is that in ensuring that your observable characteristics are nicely balanced, you've imported an assumption about their relation to characteristics that you cannot observe- so you're saying the subset of possible draws in which observable outcomes is balanced is also the subset where unobservable outcomes is balanced, which is way stronger than your traditional conditional independence assumption.

Expand full comment

I think I understand the problem with your approach to combining Bayes factors. You can only multiply them like that if they are conditionally indepent (see https://en.wikipedia.org/wiki/Conditional_independence).

In this case, you're looking for P(E1 and E2|H), where E1, E2 are the results or your two experiments and H is your hypothesis.

Now, generally P(E1 and E2|H) != P(E1|H) * P(E2|H).

If you knew e.g. P(E1 | E2, H), you could calculate P(E1 and E2|H) = P(E1 | E2, H) * P(E2|H).

Expand full comment

I think that some of the confusion here is on the difference between the Family-Wise Error Rate (FWER) and the False Discovery Rate (FDR). The first is the probability under the null hypothesis of getting at least one false positive. The second is the expected proportion of false positives over all positives. Assuming some level of independent noise, we should expect the FWER to increase with the number of tests if the power of our test is kept constant (this is not difficult to prove in a number of settings using standard concentration inequalities). The FDR, however, we can better control. Intuitively this is because one "bad" apple will not ruin the barrel as the expectation will be relatively insensitive to a single test as the number of tests gets large.

Contrary to what Scott claims in the post, Holmes-Bonferroni does *not* require independence of tests because its proof is a simple union bound. I saw in another comment that someone mentioned the Benjamini-Hochberg rule as an alternative. This *does* (sometimes) require independent tests (more on this below) and bounds the FDR instead of the FWER. One could use the Benjamini-Yekutieli rule (http://www.math.tau.ac.il/~ybenja/MyPapers/benjamini_yekutieli_ANNSTAT2001.pdf) that again bounds the FDR but does *not* require independent tests. In this case, however, this is likely not powerful enough as it bounds the FDR in general, even in the presence of negatively correlated hypotheses.

To expand on the Benjamini-Hochberg test, we acutally do not need independence and a condition in the paper I linked above suffices (actually, a weaker condition of Lehmann suffices). Thus we *can* apply Benjamini-Hochberg to Scott's example, assuming that we have this positive dependency. Thus, suppose that we have 99 tests with a p-value of .04 and one with a p-value of .06. Then applying Benjamini-Hochberg would tell us that we can reject the first 80 tests with a FDR bounded by .05. This seems to match Scott's (and my) intuition that trying to test the same hypothesis in multiple different ways should not hurt our ability to measure an effect.

Expand full comment

Sorry, I made a mistake with my math. I meant so say in the last paragraph that we would reject the first 99 tests under the BH adjustment and maintain a FDR of .05.

Expand full comment

It's true that Bonferroni doesn't require independence. However, in the absence of independence it can be very conservative. Imagine if you ran 10 tests that were all almost perfectly correlated. You then use p < 0.005 as your corrected version of p < 0.05. You're still controlling the error rate – you will have no more than a 5% chance of a false positive. In fact, much less! But you will make many more type II errors, because the true p value threshold should be 0.05.

Expand full comment

Yes this is true. First of all, the Holm-Bonferroni adjustment is uniformly more powerful than the Bonferroni correction and has no assumption on the dependence structure of the tests, so Bonferroni alone is always too conservative. Second of all, this is kind of the point: there is no free lunch. If you want a test that can take advantage of a positive dependence structure and not lose too much power, after a certain point, you are going to have to pay for it by either reducing the generality of the test or weakening the criterion used to define failure. There are some options that choose the former (like Sidak's method) and others that choose the latter, by bounding FDR instead of FWER (still others do both, like the BH method discussed above).

Expand full comment

I agree with your post, and judging from where the p-values probably came from I agree they are probably positively dependent.

FYI there is a new version of the BH procedure which adjusts for known dependence structure. It doesn’t require positive dependence but is uniformly more powerful than BH under positive dependence:

https://arxiv.org/abs/2007.10438

Expand full comment

The answer to this is Regularized Regression with Poststratification: see here -- https://statmodeling.stat.columbia.edu/2018/05/19/regularized-prediction-poststratification-generalization-mister-p/

Expand full comment

I work in big corp which makes money by showing ads. We run thouthands AB-tests yearly, and here are our the standard ways to deal with such problems:

1. If you suspect beforehand that there would be multiple significant hypothesis, run experiment with two control groups i.e. AAB experiment. Then disregard all alternatives which doesn't significant in both comparision A1B and A2B

2. If you run AB experiment and have multiple significant hypothesis, rerun experiment and only pay attention to hypothesis which were significant in previous experiment.

I am not statistician, so I'm unsure if it's formally correct

Expand full comment
founding

I work in a small corp which makes money by showing ads. No, your methods are not formally correct. Specifically, for #1, you're just going halfway to an ordinary replication, and thus only "adjusting" the significance threshold by some factor which may-or-may-not be enough to account for the fact that you're doing multiple comparisons, and for #2, you're just running multiple comparisons twice, with the latter experiment having greater power for each hypothesis tested thanks to the previous adjusted-for-multiple-comparisons prior.

All that is to say- the math is complicated, and your two methods *strengthen* the conclusions, but they by no means ensure that you are reaching any particular significance level or likelihood ratio or whatever, since that would depend on how many hypotheses, how large the samples, and so on.

Expand full comment

Thanks for clarification. Do you have some special people who make decisions in AB tests? If not I would be really interested in hearing your standard practices for AB testing

Expand full comment
founding

Generally, the decisions are "pre-registered"- we've defined the conditions under which the test will succeed or fail, and we know what we'll do in response to either state, so noone is really making a decision during/after the test. All of the decisions come during test design- we do have a data scientist on staff who helps on larger or more complicated designs, but generally we're running quite simple A/B tests: Single defined goal metric, some form of randomization between control and test (which can get slightly complicated for reasons pointed out in the OP!), power calculation done in advance to ensure the test is worth running, and an evaluation after some pre-defined period.

Usually we do replications only when A) requested, or B) enough time has passed that we suspect the underlying conditions might change the outcome, or C) the test period showed highly unusual behavior compared to "ordinary" periods.

We've been considering moving to a more overtly Bayesian approach to interpreting test results and deciding upon action thresholds (as opposed to significance-testing), but haven't done so yet.

Expand full comment

Can I ask why the big corp doesn't employ a statistician/similar to do this? I'm not trying to be snarky, but if they're investing heavily in these experiments it seems weird to take such a loose approach.

Expand full comment

It does employ several analysts which as I know have at least passable knowledge of statistics. As for why the our AB-testing isn't formally correct I have several hypothesis:

1. A lot of redundancy - we have literally thousands of metrics in each AB-test, so any reasonable hypothesis would involve at least 3-5 metrics - therefore we're unlikely to have false positive. If on the other hand we have AB-test with not significant results of something we believe in, we would often redo experiment, which decrease chances of false negative. So from bayesian point of view we are kind of alright - we just inject some of our beliefs into AB-tests

2. It's really hard to detect statistical problems in human decision (all AB-tests have human-written verdicts) because majority of human decisions have multiple justifications. Furthermore it's even harder to calculate damage from subtly wrong decision making - would we redo experiment if results would be a bit different? Did it cost us anything if we mistakenly approved design A which isn't really different from design B? A lot of legibility problems here

3. Deployment of analysts is very skewed by department - I know that some departments have 2-3x number of analysts more than we have. Maybe in such departments culture of experimentation is better.

Expand full comment

>Metacelsus mentions the Holmes-Bonferroni method. If I’m understanding it correctly, it would find the ten-times-replicated experiment above significant. But I can construct another common-sensically significant version that it wouldn't find significant - in fact, I think all you need to do is have ninety-nine experiments come back p = 0.04 and one come back 0.06.

About the Holm-Bonferroni method:

How it works, is that you order the p-values from smallest to largest, and then compute a threshold for significance for each position in the ranking. The threshold formula is: α / (number of tests – rank + 1), where α is typically 0.05.

Then the p-values are compared to the threshold, in order. If the p-value is less than the threshold the null hypothesis is rejected. As soon as one is above the threshold, that one, and all subsequent p-values in the list, fail to reject the null hypothesis.

So for your example of 100 tests where one is 0.06 and others are all 0.04, it would come out to:

Rank p Threshold

1 0.04 0.05 / 100 = 0.0005

2 0.04 0.05 / 99 = 0.00051

....

100 0.06 0.05 / 1 = 0.05

So you're right, none of those would be considered "significant". But you'd have to be in some pretty weird circumstances to have almost all your p-values be 0.04.

Expand full comment

The big concept here, is that this protocol controls the *familywise error rate*, which is the probability that at least one of your rejections of the null is incorrect. So it makes perfect sense that as you perform more tests, each test has to pass a stricter threshold.

What you are actually looking for is the *false discovery rate* which is the expected fraction of your rejections of the null that are incorrect. There are other formulas to calculate this. https://en.wikipedia.org/wiki/False_discovery_rate

Expand full comment

Yes, exactly. I made that point above with respect to what factors go into the decision on how to adjust P-values for multiple comparisons.

Expand full comment

I also forgot to mention, that these multiple hypothesis correction formulas assume that the experiments are **independent**. This assumption would be violated in your thought experiment of measuring the same thing in 100 different ways.

Expand full comment

They actually just assume positive dependence, not independence. For example, if the family of tests is jointly Gaussian and the tests are all nonnegatively correlated, then Theorem 1.2 and Section 3.1, case 1 of http://www.math.tau.ac.il/~ybenja/MyPapers/benjamini_yekutieli_ANNSTAT2001.pdf imply that that we can still use Benjamini Hochberg and bound the FDR. In particular, this should hold if the tests are all measuring the same thing, as in Scott's example.

Expand full comment

Cool, I didn't know that it still held under positive dependence

Expand full comment

"by analogy, suppose you were studying whether exercise prevented lung cancer. You tried very hard to randomize your two groups, but it turned out by freak coincidence the "exercise" group was 100% nonsmokers, and the "no exercise" group was 100% smokers."

But don't the traditionalists say that this is a feature, not a bug, of randomization? That if unlikely patterns appear through random distribution this is merely mirroring the potential for such seemingly nonrandom grouping in real life? I mean this is obviously a very extreme example for argumentative purposes, but I've heard people who are really informed about statistics (unlike me) say that when you get unexpected patterns from genuinely randomization, hey, that's randomization.

Expand full comment

Isn’t that the idea of Poisson clumping?

Expand full comment

About p thresholds, you've pretty much nailed it by saying that simple division works only if the tests are independent. And that is pretty much the same reason why the 1:1 likelihood ratio can't be simply multiplied by the others and give the posterior odds. This works only if the evidence you get from the different questions is independent (see Jaynes's PT:LoS chap. 4 for reference)

Expand full comment

re: Should they reroll their randomization

What if there was a standard that after you randomize, you try to predict as well as possible which group is more likely to naturally perform better, and then you make *that* the treatment group? Still flawed, but feels like a way to avoid multiple rolls while also minimizing the chance of a false positive (of course assuming avoiding false positives is more important than avoiding false negatives).

Expand full comment
founding

Did you get this backwards? It seems like if you're trying to minimize the false positive, you'd make the *control group* the one with a better "natural" performance result. That being said, in most of these sorts of trials, it's unclear of the helpfulness of many of the potential confounders.

Expand full comment

Yep... Thanks for pointing that out, flip what I said.

Re:unclear: I guess I'm picturing that if people are gonna complain after the fact, you'd hope it would have been clear before the fact? If people are just trying to complain then that doesn't work, but if we assume good faith it seems plausible? Maybe you could make it part of preregistration (the kind where you get peer reviewed before you run the study), where you describe the algorithm you'll use to pick which group should be the control group, and the reviewers can decide if they agree with the weightings. Once you get into territory where it isn't clear, I'd hope the after-the-fact-complaints should be fairly mild?

Expand full comment

The Bonferroni correction has always bugged me philosophically for reasons vaguely similar to all this. Merely reslicing and dicing the data ought not, I don’t think, ruin the chances for any one result to be significant. But then again I believe what the p-hacking people all complain about, so maybe we should just agree that 1/20 chance is too common to be treated as unimpeachable scientific truth!

Expand full comment

Hm. I think to avoid the first problem, divide your sample into four groups, and do the experiment twice, two groups at a time. If you check for 100 confounders, you get an average of 5 in each group, but an average 0.25 confounders in both, so with any luck you can get the statistics to tell you whether any of the confounders made a difference (although if you didn't increase N you might not have a large enough experiment in each half).

Expand full comment

When Scott first mentioned the Cordoba study (https://astralcodexten.substack.com/p/covidvitamin-d-much-more-than-you) I commented (https://astralcodexten.substack.com/p/covidvitamin-d-much-more-than-you#comment-1279684) that it seemed suspect because some of the authors of that study, Gomez and Boullion, were also involved in another Spanish Vitamin D study several months later that had major randomization issues (see https://twitter.com/fperrywilson/status/1360944814271979523?s=20 for explanation). Now, The Lancet has removed the later study and is investigating it: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3771318.

Expand full comment

The Bayesian analysis only really works if you have multiple *independent* tests of the same hypothesis. For example, if you ran your four tests using different survey populations it might be reasonable. However, since you used the same survey results you should expect the results to be correlated and thus multiplying the confidence levels is invalid.

As an extreme version of this, suppose that you just ran the *same* test 10 times, and got a 2:1 confidence level. You cannot conclude that this is actually 1000:1 just by repeating the same numbers 10 times.

Expand full comment

Combining data from multiple experiments testing the same hypothesis to generate a more powerful test is a standard meta analysis. Assuming the experiments are sufficiently similar you effectively paste all the data sets together, and then calculate the p value for the combined data set. With a bit of maths you can do this using only the effect sizes (d), and standard errors (s) from each experiment (you create 1/s^2 copies of d for each experiment and then run the p value). The reason none of the other commenters have suggested this as a solution to your ambidexterity problem (being ambidexterous isn't a problem! Ahem, anyway) is that you haven't given enough data - just the p values instead of the effect sizes and standard errors. I tried to get this from the link you give on the ambidexterity post to the survey, but it gave me seperate pretty graphs instead of linked data I could do the analysis on. However, I can still help you by making an assumption: Assuming that the standard error across all 4 questions is the same (s), since they come from the same questionarrie and therefore likely had similar numbers of responders, we can convert the p values into effect sizes - the differences (d) using the inverse normal function:

1. p = 0.049 => d = 1.65s

2. p = 0.008 => d = 2.41s

3. p = 0.48 => d = 0.05s

4. p = 0.052 => d = 1.63s

We then average these to get the combined effect size d_c = 1.43s. However all the extra data has reduced the standard error of our combined data set. We're assuming all experiments had the same error, so this is just like taking an average of 4 data points from the same distribution - i.e. the standard error is divided by sqrt(n). In this case we have 4 experiments, so the combined error (s_c) is half what we started with. Our combined measure is therefore 1.43s/0.5s = 2.86 standard deviations above our mean => combined p value of 0.002

Now this whole analysis depends on us asserting that these 4 experiments are basically repeats of the same experiment. Under than assumption, you should be very sure of your effect!

If the different experiments had different errors we would create a weighted average with the lower error experiments getting higher weightings (1/s^2 - the inverse of the squared standard error as this gets us back to the number of data points that went into that experiement that we're pasting into our 'giant meta data set'), and similarly create a weighted average standard error (sqrt(1/sum(1/s^2))).

Expand full comment

This should only be valid if the four tests are independent of each other. Since they are using the same sample, this analysis is probably invalid.

Expand full comment

Blast. You're quite right. This problem is called pseudoreplication. My limited understanding is that our options are:

1. Average the answers people gave to each of the 4 questions to create one, hopefully less noisy score of 'Authoritarianess' for each real replicate (independent person answering the questionnaire), and then run the standard significance test on that.

2. Develop a hierarchical model which describes the form of the dependence between the different answers, and then perform a marginally more powerful analysis on this.

Expand full comment

Yes. But Scott should be able to estimate the covariance: I am guessing these 4 p-values are all coming from two sample t-tests on four correlated responses. The correlations between the differences between the four means should be* the same as the individual-level within-sample correlations of the four responses, which Scott can calculate with access to the individual-level data.

* depending on modeling assumptions

Expand full comment

Interesting....

I mean the right thing to do is probably to combine all four tests into one big test. Perhaps the most general thing you could do is to treat the vector of empirical means as a sample from a 4-d Gaussian (that you could approximate the covariance of), and run some kind of multi-dimensional T-test on.

Do you know what the right test statistic would be though? If your test statistic is something like the sum of the means I think your test would end up being basically equivalent to Neil's option 1. If your test statistic was the maximum of the component means, I guess you could produce the appropriate test, but it'd be complicated. Assuming that you had the covariance matrix correct (doing an analogue of a z-test rather than a t-test), you'd have to perform a complicated integral in order to determine the probability that a Gaussian with that covariance matrix produces a coordinate whose mean is X larger than the true mean. If you wanted to do the correct t-test analogue, I'm not even sure how to do it.

Expand full comment

Monte Carlo should be a good solution for either thing you want to do. Nonparametric bootstrap is also probably a bit better than the Gaussian approximation.

I don’t think it will be roughly equivalent to treating them as independent. The average of 4 correlated sample means may have a much larger variance than the average of 4 independent ones.

Expand full comment

Not equivalent to treating them as independent. Equivalent to Neil's revision where he talks about averaging scores.

I don't think Monte Carlo works for our t-test analogue (though it works to perform the integral for the z-test). The problem is that we want to know that *whatever* the ground truth correlation matrix, we don't reject with probability more than 0.05 if the true mean is 0. Now for any given correlation matrix, we could test this empirically with Monte Carlo, but finding the worst case correlation matrix might be tricky.

Expand full comment

Ah I might not be following all the different proposals, sorry if I messed something.

Unfortunately with global mean testing there isn’t a single best choice of test stat, but in this problem “average of the 4 se-standardized mean shifts” seems as reasonable a combined test stat as anything else.

Regarding “what is the worst case correlation matrix” I was assuming Scott would just estimate the correlation matrix from the individual level data and plug it in as an asymptotically consistent estimate. Whether or not that’s reasonable depends on the sample size, which I don’t know.

If the sample size isn’t “plenty big” then some version of the bootstrap (say, iid but stratified by ambi and non-ambi) should also work pretty well to estimate the standard error of whatever test statistic he wants to use. Have to set it up the right way, but that is a bit technical for a reply thread.

Permutation test will also work in small samples as a last resort but would not give specific inference about mean shifts.

Expand full comment

I wouldn't worry too much about the vitamin D study.

For one thing, it is perfectly statistically correct to run the study without doing all of these extra tests, and as you point out, throwing out the study if any of these tests comes back with p-value less than 0.05 would basically mean the study can never be done.

Part of the point of randomizing is that you would expect that any confounding factors to average out. And sure, you got unlucky and people with high blood pressure ended up unevenly distributed. On the other hand, Previous lung disease and Previous cardiovascular disease ended up pretty unbalanced in the opposite direction (p very close to 1). If you run these numbers over many different possibly co-founders, you'd expect that the effects should roughly average out.

I feel like this kind of analysis is only really useful to

A) Make sure that your randomization isn't broken somehow.

B) If the study is later proved to be wrong, it helps you investigate why it might have been wrong.

Expand full comment

"I think the problem is that these corrections are for independent hypotheses, and I'm talking about testing the same hypothesis multiple ways (where each way adds some noise). "

If you are testing the same hypothesis multiple ways, then you four variable should be correlated. In this case you can extract one composite variable from these four with a factor analysis et voilà, just on test to perform!

Expand full comment

"Maybe I need a real hypothesis, like "there will be a difference of 5%", and then compare how that vs. the null does on each test? But now we're getting a lot more complicated than just the "call your NHST result a Bayes factor, it'll be fine!" I was promised."

Ideally what you need is prior distribution of possible hypotheses. So for example you might think before any analysis of the data that there is a 90% chance of no effect and if there is an effect you expect the size to be normally distributed about 0 with a SD of 1. Then your prior distribution on the effect size x is f(x)= .9*delta(0)+.1*N(0,1) and then if p(x) is the probability of your observation given an effect size of x you can calculate the posterior distribution of effect size as f(x)*p(x)/int(f(x)*p(x)dx)

Expand full comment

If the express goal of the "randomization" is the resulting groups being almost equal on every conceivable axis and any deviation from equal easily invalidates any conclusion you might want to draw... is randomization maybe the worst tool to use here?

Would a constraint solving algorithm not fare much, much better, getting fed the data and then dividing them 10'000 different ways to then randomize within the few that are closest to equal on every axis?

I hear your cries: malpractice, (un-)conscious tampering, incomplete data, bugs, ... But a selection process that reliably kills a significant portion of all research depite best efforts is hugely wasteful.

How large is that portion? Cut each bell curve (one per axis) down to the acceptable peak in the middle (= equal enough along that axis) and take that to the power of the number of axes. That's a lot of avoidably potentially useless studies!

Expand full comment

Clinical trial patients arrive sequentially or in small batches; it's not practical to gather up all of the participants and then use an algorithm to balance them all at once.

Expand full comment

https://maxkasy.github.io/home/files/papers/experimentaldesign.pdf

This paper discusses more or less this issue with potentially imbalanced randomization from a Bayesian decision theory perspective. The key point is that, for a researcher that wants to minimize expected loss (as in, you get payoff of 0 if you draw the right conclusion, and -1 if you draw the wrong one), there is, in general, one optimal allocation of units to treatment and control. Randomization only guarantees groups are identical in expected value ex-ante, not ex-post.

You don't want to do your randomization and find that you are in the 1% state of world where there is wild imbalance. Like you said, if you keep re-doing your randomization until everything looks balanced, that's not quite random after all. This paper says you should bite the bullet and just find the best possible treatment assignment based on characteristics you can observe and some prior about how they relate to the outcome you care about. Once you have the best possible balance between two groups, you can flip a coin to decide what is the control and what is the treatment. What matters is not that the assignment was random per se, but that it is unrelated to potential outcomes.

I don't know a lot about the details of medical RCTs, but in economics it's quite common to do stratified randomization, where you separate units in groups with similar characteristics and randomize within the strata. This is essentially taking that to the logical conclusion.

Expand full comment

1. P-values should not be presented for baseline variables in a RCT. It is just plain illogical. We are randomly drawing two groups from the same population. How could there be a systematical difference? In the words of Doug Altman: ”Performing a significance test to compare baseline variables is to assess the probability of something having occurred by chance when we know that it did occur by chance.”

In this specific study however Im not all that sure that the randomization really was that random. The d-vit crowd in Spain seems to be up to some shady stuff: https://retractionwatch.com/2021/02/19/widely-shared-vitamin-d-covid-19-preprint-removed-from-lancet-server/. My 5c: https://twitter.com/jaralaus/status/1303666136261832707?s=21

Regardless it is wrong to use p-values to choose what variables to use in the model. It is really very straight forward, just decide a priory what variables are known (or highly likely) predictors and put them in the model. Stephen Senn: ”Identify useful prognostic covariates before unblinding the data. Say you will adjust for them in the statistical analysis plan. Then do so.” (https://www.appliedclinicaltrialsonline.com/view/well-adjusted-statistician-analysis-covariance-explained)

2. As others have pointed out, Bayes factors will quantify the evidence provided by the data for two competing hypothesis. Commonly a point nil null hypothesis and a directional alternative hypothesis (”the effect is larger/smaller than 0”). A ”negative” result would not be 1:1, that is a formidably inconclusive result. Negative would be eg 1:10 vs positive 10:1.

Expand full comment

1. Exactly. The tests reported in Scott's first table above are of hypotheses that there are differences between *population* treatment groups. They only make sense if we randomized the entire population and then sampled from these groups for our study. Which we don't. I typically report standardized differences for each factor, and leave it at that.

Expand full comment

Regarding part 2, when your tests are positively correlated, there are clever tricks you can do with permutation tests. You resample the data, randomly permute your dependent variable, and run your tests, and collect the p values. If you do this many times, you get a distribution of the p values under the null hypothesis. You can then compare your actual p values to this. Typically, for multiple tests, you use the maximum of a set of statistic. The reference is Westfall, P.H. and Young, S.S., 1993. Resampling-based multiple testing: Examples and methods for p-value adjustment (Vol. 279). John Wiley & Sons.

Expand full comment

This is the kind of approach I would take for tests that I expect to be correlated. Or possibly attempt to model the correlation structure among the endpoints, if I wanted to do something Bayesian.

Expand full comment

Agreed with Westfall&Young minP approach.

Expand full comment

You can't run a z-test and treat "fail to reject the null" as the same as "both are equivalent". The problem with that study is that they didn't have nearly enough power to say that the groups were "not significantly different", and relying on a statistical cutoff for each of those comparisons instead of thinking through the actual problems makes this paper useless, in my view.

Worst example: If you have 8 total patients with diabetes (5 in the smaller group and 3 in the larger) you are saying that a 3.5 fold difference in diabetes incidence rate (accounting for the group sizes) is not significant. Obviously that is the kind of thing that can only happen if you are relying on p-values to do your thinking for you, as no reasonable person would consider those groups equivalent. It's extra problematic because, coincidentally, all of those errors happened to point in the same direction (more comorbidities in the control group). This is more of the "shit happens" category of problem though, rather than a flaw in the study design.

There are lots of ways they could have accounted for differences in the study populations, which I expect they tried and it erased the effect. Regressing on a single variable (blood pressure) doesn't count... you would need to include all of them in your regression. The only reason this passed review IMO is because it was "randomized" and ideally that should take care of this kind of issue. But for this study (and many small studies) randomization won't be enough.

I put this paper in the same category of papers that created the replication crisis in social psychology: "following all the rules" and pretending that any effect you find is meaningful as long as it crosses (or in this case, doesn't cross) the magical p=0.05 threshold. The best response to such a paper should be something like "well that is almost certainly nothing, but I'd be slightly more interested in seeing a follow-up study"

Expand full comment

You need to specify in advance which characteristics you care about ensuring proper randomization and then do block randomization on those characteristics.

There are limits to the number of characteristics you can do this for, w a given sample size / effect size power calculation.

This logic actually says that you *can* re-roll the randomization if you get a bad one --- in fact, it says you *must* do this, that certain "randomizations" are much better than other ones because they ensure balance on characteristics that you're sure you want balance on.

Expand full comment
founding

>Metacelsus mentions the Holmes-Bonferroni method. If I’m understanding it correctly, it would find the ten-times-replicated experiment above significant.

Unfortunately, the Holmes-Bonferroni method doesn't work that way. It always requires that one of the p-values at least be significant in the multiple comparisons sense; so at least one p-value less than 0.05/n, where n is the number of comparisons.

Its strength is that it doesn't require all the p-values to be that low. So if you have five p-values: 0.01, 0.02, 0.03, 0.04, 0.05, then all five are significant given Holmes-Bonferroni, whereas only one would be significant for the standard multiple comparisons test.

Expand full comment

I don't claim to have any serious statistical knowledge here, but my intuitive answer is that expected evidence should be conserved.

If you believe that vitamin D reduces COVID deaths, you should expect to see a reduction in deaths in the overall group. It should be statistically significant, but that's effectively a way of saying that you should be pretty sure you're really seeing it.

If you expect there to be an overall difference, then either you expect that you should see it in most ways you could slice up the data, or you expect that the effect will be not-clearly-present in some groups but very-clearly-present in others, so that there's a clear effect overall. I think the latter case means _something_ like "some subgroups will not be significant at p < .05, but other subgroups will be significant at p < (.05 / number of subgroups)". If you pick subgroups _after_ seeing the data, your statistical analysis no longer reflects the expectation you had before doing the test.

For questions like "does ambidexterity reduce authoritarianism", you're not picking a single metric and dividing it across groups - you're picking different ways to operationalize a vague hypothesis and looking at each of them on the same group. But I think that the logic here is basically the same: if your hypothesis is about an effect on "authoritarianism", and you think that all the things you're measuring stem from or are aspects of "authoritarianism", you should either expect that you'll see an effect on each one (e.g. p = .04 on each of four measures), or that one of them will show a strong enough effect that you'll still be right about the overall impact (e.g. p = .01 on one of four measures).

For people who are giving more mathematical answers: does this intuitive description match the logic of the statistical techniques?

Expand full comment

"Should they have checked for this right after randomizing, noticed the problem, and re-rolled their randomization to avoid it? I've never seen anyone discuss this point before. The purist in me is screaming no - if you re-roll your randomization on certain results, then it's not really random anymore, is it? But it seems harsh to force them to perform a study even though we know we'll dismiss the results as soon as we get them. If we made them check a pre-written list of confounders and re-roll until there were no significant differences on any of them, what could go wrong? I don't have a good answer to this question, but thinking about it still creeps me out."

The orthodox solution here is stratified random sampling, and it is fairly similar. For example, you might have a list of 2,500 men and 2,500 women (assume no NBs). You want to sample 500, 250 for control and 250 for intervention, and you expect that gender might be a confound. In stratified random sampling, instead of sampling 250 of each and shrugging if there's a gender difference, you choose 125 men for the control and 125 men for the intervention, and do the same for women. This way you are certain to get a balanced sample (https://www.investopedia.com/terms/stratified_random_sampling.asp, for example). While this only works with categorical data, you can always just bin continuous data until it cooperates.

The procedure is statistically sound, well-validated, and commonly practiced.

Expand full comment

I know of one randomized experiment which matched people into pairs which were as similar as possible, then for each pair chose randomly which was treatment and which was control.

Expand full comment
founding

>I chose four questions that I thought were related to authoritarianism

What you can do if you're trying to measure a single thing (authoritarianism) and have multiple proxies, is to average the proxies to get a single measure, then calculate the p-value using the average measure only. I'd recommend doing that now (even though, ideally, you'd have committed to doing that ahead of time).

Expand full comment

For the Vitamin D study, I think you are off track, as were the investigators. Trying to assess the randomness of assignment ex post is pretty unhelpful. The proper approach is to identify important confounders ahead of time and use blocking. For example, suppose that blood pressure is an important confounder. If your total n is 160, you might divide blood those 160 into four blocks of 40 each, based on blood pressure range. A very high block, a high block, a low block, and a very low block. Then randomly assign half within each block to get the treatment, so 20 very high blood pressure folks get the vitamin D and 20 do not. That way you know that blood pressure variation across subjects won't mess up the study. If there are other important confounders, you have to subdivide further. If there are so many important confounders that the blocks get to be too small to have any power, then you needed a much bigger sample in the first place.

Expand full comment

You would want to have a data-driven method to identify the confounders that you will use to block. Otherwise, you are embedding your intuition into the process.

Expand full comment
Comment deleted
Expand full comment

I don't disagree, but that means you do two sets of analysis. One to identify the confounders and another to test the hypothesis (unless I have misunderstood). Reminiscent of the training / validation approach used for ML, don't you think?

Expand full comment

But you do not learn about the possible confounders by doing the study. You learn about them ahead of time from other research, e.g. research that shows that blood pressure interacts strongly with COVID. Blocking can be data driven in the sense that you have data from other studies showing that blood pressure matters. Of you need to do a study to learn what the confounders are, then you need to do another study after that, based on that information.

Expand full comment
founding

And, of course, the larger the sample, the less you have to use advanced statistics to detect an effect.

Another alternative: replicate your results. The first test, be sloppy, don't correct for multiple comparisons, just see what the data seems to say, and formulate your hypothesis(es). Then test that/those hypothesis(es) rigourously with the second test.

You can even break your single database into two random pieces, and use one to formulate the hypothesis(es) and the other to test it/them.

Expand full comment

Significance doesn't seem like the right test here. When you are testing for significance, you are asking a question about how likely it is that differences in the sample represent differences in the wider population (more or less, technically you are asking for frequentist statistics "if I drew a sample and this variable was random, what is the chance I would see a difference at least this large"). In this case, we don't care about that question, we care if the actual difference between two groups is large enough to cause something correlated with it to show through. At the very least, the Bonferroni adjustment doesn't apply, in fact I would go in the other direction. The difference needs to be big enough that it has some correlation with the outcome strong enough to cause a spurious result.

Expand full comment

On section I...

The key point of randomized trials is NOT that they ensure balance of each possible covariate. The key point is that they make the combined effect of all imbalances zero in expectation, and they allow the statistician to estimate the variance of their treatment-effect estimator.

Put another way, randomization launders what would be BIAS (systematic error due to imbalance) into mere VARIANCE (random error due to imbalance). That does not make balance irrelevant, but it subtly changes why we want balance — to minimize variance, not because of worries about bias. If we have a load of imbalance after randomizing, we'll simply get a noisier treatment-effect estimate.

"If the groups are different to start with, then we won't be able to tell if the Vitamin D did anything or if it was just the pre-existing difference."

Mmmmmaybe. If the groups are different to start with, you get a noisier treatment-effect estimate, which MIGHT be so noisy that you can't reject the vitamin-D-did-nothing hypothesis. Or, if the covariates are irrelevant, they don't matter and everything turns out fine. Or, if vitamin D is JUST THAT AWESOME, the treatment effect will swamp the imbalance's net effect anyway. You can just run the statistics and see what numbers pop out at the end.

"Or to put it another way - perhaps correcting for multiple comparisons proves that nobody screwed up the randomization of this study; there wasn't malfeasance involved. But that's only of interest to the Cordoba Hospital HR department when deciding whether to fire the investigators."

No. It's of interest to us because if we decide that the randomization was defective, all bets are off; we can't trust how the study was reported and we don't know how the investigators might have (even if accidentally) put their thumb on the scale. If we instead convince ourselves that the randomization was OK and the study run as claimed, we're good to apply our usual statistical machinery for RCTs, imbalance or no.

"But this raises a bigger issue - every randomized trial will have this problem. [...] Check along enough axes, and you'll eventually always find a "significant" difference between any two groups; [...] if you're not going to adjust these away and ignore them, don't you have to throw out every study?"

No. By not adjusting, you don't irretrievably damage your randomized trials, you just expand the standard errors of your final results.

Basically, if you really and truly think that a trial was properly randomized, all an imbalance does is bleed the trial of some of its statistical power.

See https://twitter.com/ADAlthousePhD/status/1172649236795539457 for a Twitter thread/argument with some interesting references.

Expand full comment

Regarding whether hypertension explains away the results, have not read the papers (maybe Jungreis/Kellis did something strictly better), but here's a simple calculation that sheds some light I think:

So 11 out of the 50 treated patients had hypertension. 39 don't.

And 15 out of the 26 control patients had hypertension. 11 don't.

You know that a total of 1 vitamin D patients were ICU'd. And 13 of the control patients were admitted.

There is no way to slice this such that the treatment effect disappears completely [I say this, having not done the calculation I have in mind to check it -- will post this regardless of what comes out of the calculation, in the interest of pre-registering and all that]

To check this, let's imagine that you were doing a stratified study, where you're testing the following 2 hypotheses simultaneously:

H1: Vitamin D reduces ICU rate among hypertension patients

H2: Vitamin D reduces ICU rate among non-hypertension patients.

Your statistical procedure is to

(i) conduct a Fisher exact test on the 11 [treated, hypertension] vs 15 [control, hypertension] patients

(ii) conduct a Fisher exact test on the 39 [treated, no-hypertension] vs 11 [control, no-hypertension] patients

(iii) multiply both by 2, to get the Bonferroni-corrected p-values; accept a hypothesis if its Bonferroni-corrected p-value is < 0.05

If we go through all possible splits* of the 1 ICU treated patient into the 11+39 hypertension+non-hypertension patients and the 13 ICU control patients into the 15+11 hypertension+non-hypertension patients (there are 24 total possible splits), the worst possible split for the "Vitamin D works" camp is if 1/11 hypertension & 0/39 non-hypertension treated patients were ICU, and 10/15 hypertension & 3/11 non-hypertension control patients were ICU.

In this case still you have a significant treatment effect (the *corrected* p-values for H1 and H2 are 0.01 and 0.02 in this case).

I don't know how kosher this is formally, but it seems like a rather straightforward & conservative way to see whether the effect still stands (and not a desperate attempt to wring significance out of a small study, hopefully), and it does seem to stand.

This should also naturally factor out the direct effect that hypertension may have on ICU admission (which seems to be a big concern).

Other kinds of uncertainty might screw things up though - https://xkcd.com/2440/ -- given the numbers, I really do think there *has* to have been some issue of this kind to explain away these results.

*simple python code for this: https://pastebin.com/wCpPaxs8

Expand full comment

Seems like it should be relevant that sunlight, which is a cause of vitamin D, also reduces blood pressure through nitric oxide.

https://www.heart.org/en/news/2020/02/28/could-sunshine-lower-blood-pressure-study-offers-enlightenment

Expand full comment

My suspicion is that you are reaching the limits of what is possible using statistical inference. There might be, however, alternative mathematical approaches that might provide an answer to the real question "should we prescribe X and, if so, to whom?". I refer specifically to optimisation-based robust classification / regression (see e.g. work by MIT professor Bertsimas https://www.mit.edu/~dbertsim/papers.html#MachineLearning - he's also written a book on this). But I would still worry about the sample size, it feels small to me.

Expand full comment

For the ambidexterity question, what was your expected relation between those four questions and the hidden authoritarianism variable? Did you expect them all to move together? All move separately but the more someone got "wrong" the more authoritarian they lean? Were you expecting some of them to be strongly correlated and a few of them to be weakly correlated? All that's to ask: is one strong sub-result and three weak sub-results a "success" or a "failure" of this prediction? Without a structural theory it's hard to know what to make of any correlations.

--------------

Then, just to throw one more wrench in, suppose your background doesn't just change your views on authoritarianism, but also how likely you are to be ambidextrous.

Historically, in much of Asia and Europe teachers enforced a heavy bias against left-handedness in activities like writing. [https://en.wikipedia.org/wiki/Handedness#Negative_connotations_and_discrimination] You don't see that bias exhibited as much by Jews or Arabs [anecdotal], probably because of Hebrew and Arabic are written right-to-left. But does an anti-left-handed bias decrease ambidexterity (by shifting ambidextrous people to right-handed) or increase ambidexterity (by shifting left-handed people to ambidextrous)? Does learning language with different writing directions increase ambidexterity? Is that checkable in your data?

Most of us learned to write as children [citation needed], and since most of your readership and most of the study's respondents are adults [citation needed] they may have already been exposed to this triumph of nurture over nature. It's possible that the honest responses of either population may not reflect the natural ambidexterity rate. Diving down the rabbit hole, if the populations were subject to these pressures, what does it actually tell us about the self-reported ambidextrous crowd? Are the ambidextrous kids from conservative Christian areas the kids who fell in line with an unreasonable authority figure? Who resisted falling in line? Surely that could be wrapped up in their feelings about authoritarianism.

Expand full comment

These examples are presented as being similar, but they have an important distinction.

I agree that testing p-values for the Vitamin D example doesn't make too much sense. However, if you did want to perform this kind of broad ranging testing, I think you should be concerned with the false discovery rate rather than the overall level of the tests. Each of these tests is, in some sense, a different hypothesis, and should receive it's own budget of alpha.

The second example tests as single hypothesis in multiple ways. Because it's a single hypothesis, it could make sense to control the overall size of the test at 0.025. However, because these outcomes are (presumably) highly correlated, you should use a method that adjusts for the correlation structure. Splitting the alpha equally among four tests is unnecessarily conservative.

Expand full comment

Regarding question 2: There are various options here, the majority of of which will be most effective if you (a) know what the dependence is between your test statistics, and (b) precisely specify what you want to test.

For (a): if what you’re doing here are basic two-sample t-tests, and if the sample size of ambidextrous people is “reasonably large” then by CLT your 8 sample means (2 samples x 4 responses) are ~multivariate Gaussian with a covariance matrix you can estimate accurately based on individual-level data. Your calculations — Bayesian and frequentist — should take that into account (I can explain more if this is indeed the case).

If it’s too small for CLT to be realistic, you can still do things with permutations but the hypotheses you can test are less interesting; I assume you would rather conclude “the mean is larger in group 2” (which a t-test can give you) than conclude “the distributions in the two groups are not identical” (which is what you usually get out of a permutation test).

For (b): the two main options are testing the global null hypothesis (no differences in any of the four responses vs some difference for at least one response) or testing for differences in the four individual responses. In the first case you want to compute some combined p-value (again various options but there is no universally best way to do this and it isn’t really valid to pick one after seeing the data).

In the second case a natural choice is to compute simultaneous confidence intervals for the four mean differences, calculating the correct width using the multivariate Gaussian approximation or bootstrap. I would recommend this last option: happy to provide further details on request.

Expand full comment

"When you're testing independent hypotheses, each new test you do can only make previous tests less credible."

This doesn't make sense. If the hypotheses are independent, your view of each hypothesis should not depend on the result in the others. Your view of the effect of a new drug on cancer progression should have no relation to whether you are at the same time looking into whether gender is related to voting preference in an election. The whole idea of multiple testing correction as practiced in mainstream science seems to readily lead to absurdity and I'm not sure why this isn't more discussed (maybe it is within statistics communities I'm not part of, but mainstream scientists without statistical background seem to just accept the fundamental weirdness of the idea at face value).

How this relates to your question, people are suggesting all sorts of ways to "correct" your result but it's not clear to me why these are improvements. The data seems convincing that there is a relationship--is it more convincing if you apply some correction?

Expand full comment

"But suppose I want to be extra sure, so I try a hundred different ways to test it, and all one hundred come back p = 0.04. Common sensically, this ought to be stronger evidence than just the single experiment; I've done a hundred different tests and they all support my hypothesis."

The problem here seems to me that you're trying to apply common sense to a topic that isn't really in the realm of common sense. This is true not only in the sense that our intuition does a poor job of estimating the effects of sample variation (and therefore seeing patterns where there are none), but more importantly in that the concept of a p-value isn't easy to parse in common sense.

For example, what do you mean by saying a p-value of 0.04 supports your hypothesis? Say you're doing a one sample two-sided t-test against an hypothesized mean: how is the p-value calculated? First (forgive me if I make some minor errors here) you calculate a test statistic T = (sample mean - hypothesized mean)/(sample standard error), then the p-value is (1 - t)*2 where t is value of the CDF of a Student's t-distribution with (sample size - 1) degrees of freedom... it's not immediately obvious that any of this will collapse down into something interpretable as "common sense".

What we can say is that - if the thing you're measuring really is normally distributed and has a true mean equal to the hypothesized mean, then across many samples the resulting p-values will be uniformly distributed between 0 and 1. If the true mean differs then the distribution of p-values will skew more and more towards low values (too bad the comments here don't allow pictures!).

This means that if there is no difference between the true and hypothesized mean, you'd expect to get p-values < 0.05, 5 % of the time. A p-value of 0.04 means that, if, hypothetically, the means weren't different, you'd only have seen as much or more difference than you actually saw, 4 % of the time. Saying that this supports your hypothesis is dangerous leap over what's really a non-intuitive idea. The best you can say is that the effect is unlikely to be due to chance. If you walk through life declaring that p = 0.04 supported your hypothesis you'd be wrong 4 % of the time. Every new test you do *does* (in a sense) make the previous test less credible because the more tests there are the more likely at least one thing is plausibly just random noise.

If you performed 100 studies and they all got p ~ 0.04 I would be astounded and confused but only convinced that something extremely bizarre is going on. What kind of process would even produce such a result with other than astronomically unlikely odds? If your studies' data were genuinely independent you should have got a scatter of p-values - between 0 and 1 if there were no difference, and skewed towards 0 if there was. Inventing this kind of scenario seems a perfect example of how "common sense" doesn't work here.

Personally, I think there's nothing a p-value can tell you that a confidence interval can't tell you 100x better, and die a little inside whenever I see one.

Expand full comment

"Check along enough axes, and you'll eventually always find a "significant" difference between any two groups; if your threshold for "significant" is p < 0.05, it'll be after investigating around 20 possible confounders...So if you're not going to adjust these away and ignore them, don't you have to throw out every study? I don't think there's a formal statistical answer for this."

This is a well-known problem, going back to Fisher, for which a lot of methods have been developed. See here for a fairly in-depth (in parts technical) treatment of them co-authored by one of the top statisticians of the last 50 years working in this field: https://arxiv.org/pdf/1207.5625.pdf .

Expand full comment

Regarding point 2, when you have multiple tests assessing the same hypothesis: why not use the average p-value? If the null hypothesis is true, you expect to obtain an average p-value equal or lower than the observed-average p-value, an observed-average p-value * 100 percent of the times.

Expand full comment

However, this method is conservative towards the null hypothesis, in Scott's case it gives a p-value of 0.147. Another approach could be: if the null hypothesis is true and we run N experiments each yielding a p-value, what is the probability that we obtain a minimum p-value equal or lower than the observed-minimum p-value? It would be 1-(1-observed-minimum p-value)^N. In Scott's case, 0.0316.

Expand full comment

sloppy thinking! the sum of p-values is described by the Irwin–Hall distribution. My suggestion is rubbish, pleasure ignore.

Expand full comment

> For example, it seems like opposition to immigration really was strongly correlated with ambidexerity, but libertarianism wasn’t. So my theory that both of these were equally good ways of measuring some latent construct “authoritarianism” was wrong. Now what?

Did you check whether opposition to immigration was correlated with libertarianism in the way you expected? As a Trump-supporting libertarian, and an anti-immigration anti-authoritarian, I must admit I felt a bit frustrated with your identification of the "authoritarianism" latent variable.

In terms of "now what", maybe it's time to go back to the drawing board and think about what "authoritarianism" actually means. Is it even a real thing, or is it just a label that people apply to things that they don't like?

Expand full comment

I wrote a paper on why using inferential statistical tests for this purpose is wrong:

https://www.sciencedirect.com/science/article/abs/pii/S0093934X16300323

To quote our abstract:

Experimental research on behavior and cognition frequently rests on stimulus or subject selection where not all characteristics can be fully controlled, even when attempting strict matching. For example, when contrasting patients to controls, variables such as intelligence or socioeconomic status are often correlated with patient status. [...] One procedure very commonly employed to control for such nuisance effects is conducting inferential tests on [...] subject characteristics. [...] Such a test has high error rates and is conceptually misguided. It reflects a common misunderstanding of statistical tests: interpreting significance not to refer to inference about a particular population parameter, but about 1. the sample in question, 2. the practical relevance of a sample difference (so that a nonsignificant test is taken to indicate evidence for the absence of relevant differences). We show inferential testing for assessing nuisance effects to be inappropriate both pragmatically and philosophically, present a survey showing its high prevalence, and briefly discuss an alternative in the form of regression including nuisance variables.

Expand full comment

Those aren't Bayes factors. The Bayes factor in favor of the Ambidextrous Authority hypothesis (AAH) would be P(you got that data|AAH) / P(you got that data|~AAH) = P(you got that data|AAH) / p-value.

You haven't figured out what you think is the probability of getting those results if your non-null hypothesis is true. That's tricky, because it means you have to nail down a concrete non-null hypothesis. That's *especially* tricky because you're looking at four different outcomes that won't necessarily fit nicely under a single model.

How much do you want this number?

Expand full comment

Жубайым экөөбүз Миддлмарчты, провинциялык жашоо жөнүндө изилдөө, окуп жатып, бир досубуз бизге бул шилтемени жөнөттү. https://youtu.be/KYWD45FN5zA

Түшүнүксүз телекөрсөтүү Кызыл Кытайдын Батышка каршы пландарын алдын ала айтканбы?

Expand full comment

(1) While "Doctor Who" may not be known everywhere, it's hardly an obscure TV show (2) Are we getting political commentary from Youtube links now? (3) The answer to your question is "no" (4) I congratulate you on either your grasp of multifarious languages or your skill in using Google Translate

Expand full comment

"Are we getting political commentary from Youtube links now?"

By definition, links link us to information, which may be presented in form of videos, which may be made available through Youtube.

"The answer to your question is 'no'."

Says you, yet the resemblance is too uncanny for words.

Expand full comment

Hot new threat for 2021 - Davros and the Daleks. You read it here first, folks!

Expand full comment

Not the fictional characters. I am talking about the modus operandi depicted. Isn't it possible it has been inspired by Red China's plans? I was told that was a British TV show, Mao Zedong was alive in China and Labour was in charge in the UK back then.

Expand full comment

Ok, Scott, I think this is a solved problem (under certain assumptions, that most of the multiple comparison methods like Bonferroni correction use), and furthermore, your intuition about "so I try a hundred different ways to test it, and all one hundred come back p = 0.04... this should be stronger evidence" is correct.

The classic version of the multiple comparison adjustment says "we have done K tests; given K, how likely was it that our best p-value would be smaller than some critical value? And specifically, how low a critical value alpha_adj do we need to use such that the chance that the best p-value is below alpha_adj is less than alpha?" And then to get an overall "family" type I error rate of alpha, we use the lower critical p-value for the individual test.

This is fine, but it only answers what it answers. It says, to be clear, how likely is it that our *best* p-value would be at least this small. But that throws away information on all the other p-values below the best one.

The situation you describe, with the four different experiments, testing the same hypothesis a few different ways, essentially says "hey, don't we have extra information? And doesn't it seem rather unlikely that most of the individual tests we conduct give results that, under the null, should be pretty unlikely to happen (strictly speaking, pretty unlikely that something at least this extreme happens)?" (Even if none of the p-values is extremely small, it *should* be pretty unlikely for independent tests to keep finding p = 0.05).

We can make this clear, and show the solution, with a simple example. Imagine a (possibly biased) coin, that we flip 10000 times. But we split the flips into 100 sets of 100 flips, and count the number of heads for each.

We're interested in testing the hypothesis that the coin is biased (towards heads), compared to a null of the coin being fair (Prob(Heads) = 0.5). Note that by construction, in this specific case, *either* the null is always correct, or it is always false. The coin is biased, or it isn't. You have 4 tests of the same underlying hypothesis; the hypothesis is either false, or it isn't. (Note the contrast with testing whether 100 different medications cure an illness - anywhere between 0 and 100 of them might be false. This complicates things, but I'll leave that aside for now).

We always (probably always?) construct statistical tests by comparing against the null of no effect / no difference etc. Under the null that the coin is not biased, each of our 100 experiments (of 100 coin flips each) is independent. You get 50 heads out of 100 in expectation, and experiment 73 yielding 57 heads tells you *nothing* about the distribution of outcomes for experiment 87. And so on. For each experiment you have a realisation, and a p-value associated with it (for simplicity, the one-sided p-value - our hypothesis is bias in favor of heads). Sort/rank the p-values.

The typical multiple comparisons test says "under the null of no true effect, how likely is it that the best p-value that arose from doing this procedure K (i.e. 100) times would be smaller than our best (smallest) p-value?" This can be calculated using the standard corrections (Bonferroni is a linear approximation of the correct calculation, which is Sidak's method mentioned in the linked piece).

But we can also say "ok, but how likely is it that if we did this procedure K times, the Nth smallest p-value would be smaller than our Nth smallest p-value?" You can do this separately for each value of N (from 1 to K). In each case, it is a fairly standard binomial probability calculation, which will look really ugly to write without LATEX. In your example, if you have 4 experiments, it is pretty unlikely that the second worst would still have a p-value of ~0.05. So this method picks up on the point you intuitively arrive at - if you do a randomised experiment many times, it should be pretty unlikely to get borderline significant results every time. (Note that in my stylised example, we can just combine the experiments into one jumbo experiment of 10000 coin flips, and then all those p = 0.04 will become a single p = 0.000000000001, exactly matching your intuition that these low p-values should "add up" in evidentiary terms).

Another commenter mentions Holmes-Bonferroni (which you then comment on). This is not the same. It (I think) ends up being a linear approximation to the correct correction in the "100 different tests for 100 different medications curing an illness" (with a weird caveat I'll skip). Another commenter mentions Benjamani-Hochberg. I think it is basically a linearised version of the the binomial probability calculation I mention above - it is close to identical for small N (relative to K, i.e. is the second best p-value demonstrative of a "real" effect) but (I think) is way too conservative for N close to K. (I want to avoid writing 500 words explaining the distinction here that no-one will read... so leaving there)

So basically, you're correct, and it can be calculated properly.

CAVEAT: Independence - note that in my example, each of my 100 experiments was independent. In your example, you have 4 tests of a hypothesis. If those tests are independent - essentially, the information they contain doesn't overlap in any way, this method works. If the tests aren't independent, all of the most standard multiple testing corrections methods are in trouble. (Because calculating joint probabilities is hard for non-independent events, you can't just multiply p(A) with p(B)).

Thinking about independence is tricky, because "if the hypothesis is true", of course all 4 tests are likely to give low p-values, and if the hypothesis is not, they are not. Isn't this a violation of independence? No. Why not? Because we construct p-values under the null of no true effect, i.e. that the hypothesis is false. So what we mean by independence is "assuming there is no true effect, would finding a low p-value on test #1 change the probability distribution of p-values (e.g. make a low p-value more likely) for test #2-4?" The coin flip example is a good example of independence. As for a simple example of non-independence, consider the following: we want to know if wearing a headband makes athletes run faster. So we turn up at the local track meet, randomly give some kids headbands, and time them in the 100m sprint. (And get some p-value). Then we *don't* reallocate the headbands, and time them in the 200m sprint. (And get some p-value). Even if the headband has no effect, if we randomly happened to give the headbands to kids who are faster (or slower) than average already, we'll get similar results (and thus p-values) in each case because people who are fast at 100m are also fast at 200m. The experiments reuse information.

Expand full comment

Many are making similar comments below, but the key takeaway is that naive randomization is dumb because we can do stratified sampling instead. Naive randomization is an artifact of the days before fast computers when a good stratification would have been hard.

Stratification is like if you did naive randomization a million times and then took the randomization instance with the best class balance and ran with it, which despite Scott's reservations is better than taking a crappy randomization that guarantees problems with interpretation.

But even with this crappy randomization they can still so propensity score matching and deal with it, up to a point.

Expand full comment

False Discovery Rate (FDR) correction.(Benjamini and Hochberg) is now much more common than Bonferroni because, yeah, Bonferroni is way too conservative. FDR effectively makes your lowest p value N times higher (N is number of tests), the next best one N/2 times higher, the next one N/3 times higher and so on. In contrast Bonferroni makes them all N times higher.

Also for both FDR and Bonferroni there are modifications for correlated tests which males the multiples smaller, so you don't destroy your significance just by doing the same test on a bunch of copies of the same data set or other correlated variables from one data set, so the "100 different ways to test it" concern isn't really a concern since you won't be multiplying by 100, even for Bonferroni.

Bonferroni without correction for correlations is really only recommended for those with p values so small they can afford to just multiply by a large number for no reason, still have significance, and thus dispense with their critics. It's the ultimate power move, pun intended.

Expand full comment

+1 on this. FDR is an easy to apply (just shove the p-values into an R function) solution to this problem.

Expand full comment

This kind of thing is super relevant when you're analyzing the results of an A/B test in a game, website, or app: you may have one target variable (revenue/retention/crash rate/etc), but in addition it's *really* crucial to make sure that your changes aren't having negative effects that you didn't expect. You can end up looking at 30 different metrics that all measure engagement for different types of users and tell different stories, and the (extremely tough and almost never well done) job of an analyst is to figure out whether the test improved anything or not.

Then the fuckin PM who specced the feature comes in, looks through the 50 metrics associated with a test, picks the one with the biggest change and plugs it into an online significance calculator to declare victory and get a promotion. C'est la vie.

This stuff is way more fiddly and subjective in the real world than stats classes make it seem. A good analyst can make the data tell almost any story they want, and can do so without making it clear that they cheated, and sometimes without even realizing it themselves. A great analyst knows how bad results smell and how to avoid deluding themselves, and proves it by delivering business results. Unfortunately it's really hard to tell the difference until you've worked with someone for some time.

Expand full comment

If you rerandomized until none of your comparisons had p <0.05 for any difference in any of 100 possibly-relevant measurements, you would in some sense have an incredibly weird and unrepresentative instance of dividing the group into control and treatment groups. Maybe it would be okay, maybe it wouldn't, but I would certainly no longer be comfortable considering it a randomized study.

Expand full comment

> Should they have checked for this right after randomizing, noticed the problem, and re-rolled their randomization to avoid it? I've never seen anyone discuss this point before.

Rerandomization is definitely a thing that gets discussed. See, for example, https://healthpolicy.usc.edu/evidence-base/rerandomization-what-is-it-and-why-should-you-use-it-for-random-assignment/

Expand full comment

Judea Pearl has written a book about using Bayesian statistics while including causality graphs. I still have it on my list to read it again because much went over my head. However, it seems very applicable to these problems. I recall he claimed that you could do randomized tests much more efficient than randomized trials. https://en.wikipedia.org/wiki/The_Book_of_Why

Expand full comment

> Multiply it all out and you end up with odds of 1900:1 in favor

That's not how that works. You were complaining that Bonferroni adjustments were too conservative, because they under-account for correlation, but now you did the opposite, where you assumed they were independent evidence.

Of course, understanding the correlation of multiple streams of evidence is HARD. And that's a large part of why people really like the 1-study, 1-hypothesis setup. It's the same kind of curse of dimensionality we see making lots of other things hard.

Informally, figuring out the correlation matrix of N variables requires far more data than figuring out pairwise correlations - so you're probably not actually better off with 10 studies looking at impact of 1 variable each than you are with 1 study ten times as large looking at all of them, but the size of the confidence intervals, and the costs of your statistical consultants figuring out what your data means, will be far higher in the second case. Keeping it simple fixes that - and punitive Bonferroni corrections are conservative, but for exactly that reason they keep people very, very honest with not overstating their findings.

Expand full comment

Maybe you could use the „Structure Equation Modeling“ / „Confirmatory Factor Analysis“ Framework for your Analysis. You are using the four indicators, because you think they're influenced by one common factor „authoritarianism“. Naturally the four indicators are not 100% predetermined by „authorianism“ and there is some error. The SEM-approach allows you to extract the common factor with error correction. (I think of this approach as some kind of optimization problem: find the value for each datapoint which best predicts the four measured indicators.)

After that you could correlation between your manifest measure of ambidexterity and the latent measure of authoritarianism (with only one comparison and corrected measurement error in your four indicators).

Expand full comment

>Should they have checked for this right after randomizing, noticed the problem, and re-rolled their randomization to avoid it? I've never seen anyone discuss this point before. The purist in me is screaming no - if you re-roll your randomization on certain results, then it's not really random anymore, is it? But it seems harsh to force them to perform a study even though we know we'll dismiss the results as soon as we get them. If we made them check a pre-written list of confounders and re-roll until there were no significant differences on any of them, what could go wrong? I don't have a good answer to this question, but thinking about it still creeps me out.

See https://en.wikipedia.org/wiki/Stratified_sampling

Stratify your population into subpopulations based on every variable which could conceivably correlate with ease of COVID recovery, then randomly split each subpopulation into a treat/do not treat group.

Another strategy I've seen is to pair every study participant with some other participant who looks very much like them in every way which seems to matter. Randomly give Vitamin D to one but not the other.

I guess these strategies might be a little impractical if you don't have access to the entire population at the start of the study (e.g. if a few new people are entering/exiting the ER for COVID every day, and you want to run the study over several months).

Expand full comment

The important thing is to avoid creating additional researcher degrees of freedom. Any method is fine, as long as it's implemented in a randomizer function in R and the researcher only runs that function once (instead of fiddling somehow until the randomization comes out the way they want to see it).

Expand full comment

> But with very naive multiple hypothesis testing, I have to divide my significance threshold by one hundred - to p = 0.0005 - and now all hundred experiments fail. By replicating a true result, I've made it into a false one!

This is really not the right interpretation of p-values.

"I try a hundred different ways to test it, and all one hundred come back p = 0.04." is not a thing. You would not get the same value on every test; you'd get a variety of answers, some that pass the 0.05 threshold and some that fail it. For a simple sample of a normal distribution the mean you measure will vary proportional to σ/sqrt(n). If your sample size isn't huge, a significant number of your measurements could be a whole standard deviation or more closer to the null hypothesis value -- definitely not a passing p-value at all.

In your case, getting 1 significant, 1 very not significant, 2 close (which imo is the better read of the situation) is clearly very possible, like, a distribution around a mean which falls either just above p=0.05 or just below. Clearly it's not very valid to just say "well two are significant therefore the result is significant" -- your very-failing test _disproves_ the same thing your passing tests prove!

(That said, maybe the failing question was just not a good choice of signal. Nothing can be done in the statistics about that.)

Expand full comment

>In the comments, Ashley Yakeley asked whether I tested for multiple comparisons; Ian Crandell agreed, saying that I should divide my significance threshold by four, since I did four tests. If we start with the traditional significance threshold of 0.05, that would mean a new threshold of 0.0125, which result (2) barely squeaks past and everything else fails.

No, you did eight tests since it was two-tailed. The correct threshold is 0.00625. "Ambidextrous people more libertarian" and "ambidextrous people less libertarian" are two separate comparisons and you should correct for that. You should also pre-register (at least in your head) whether you're doing a one or two tailed test.

It's worth noting that you are allowed to divide the comparisons up unevenly when doing bonferroni correction. Consider an experiment with two comparisons. You can say that Comparison A has to be P<0.01 or comparison B has to be P<0.04, rather setting both thresholds at 0.025.

IANAS but my intuition is that this sort of statistics is the researcher saying "The null hypothesis is false and I can prove it by specifying ahead of time where the results will end up." They can mark out any area of the possible result-space so long as the null hypothesis doesn't give it more than a 1/20 chance of the results landing in that area.

Expand full comment

If your outcomes are correlated, use the Westfall-Young correction instead of Bonferroni. Bonferroni is generally regarded as overly conservative

https://ideas.repec.org/c/boc/bocode/s458440.html

Expand full comment

Now even more confusion on the Vitamin D front! News story today saying that there is now a report by a group of politicians in my country recommending people take Vitamin D supplements: https://www.rte.ie/news/2021/0407/1208274-vitamin-d-covid-19/

"The 28-page report, published this morning, was drawn-up by the cross-party Oireachtas Committee on Health in recent weeks as part of its ongoing review of the Covid-19 situation in Ireland.

It is based on the views of the Covit-D Consortium of doctors from Trinity College, St James's Hospital, the Royal College of Surgeons in Ireland, and Connolly Hospital Blanchardstown, who met the committee on 23 February.

The Department of Health and the National Public Health Emergency Team have previously cautioned that there is insufficient evidence to prove that Vitamin D offers protection against Covid-19.

The report says while Vitamin D is in no way a cure for Covid-19, there is increasing international evidence from Finland, France and Spain that high levels of the vitamin in people can help reduce the impact of Covid-19 infections and other illnesses."

So should you or shouldn't you take vitamin D? At this stage I honestly have no idea. Presumably you should if you're deficient, but too much will lead to calcium deposits on the walls of your blood vessels, according to a Cleveland cardiologist https://health.clevelandclinic.org/high-blood-pressure-dont-take-vitamin-d-for-it-video/. So how much is *too* much?

The report in question: https://www.rte.ie/documents/news/2021/04/2021-04-07-report-on-addressing-vitamin-d-deficiency-as-a-public-health-measure-in-ireland-en.pdf

Expand full comment

Even before the pandemic, the official NHS recommendation was to take vitamin D from September to March. With the pandemic meaning that people spend less time outside, the question is whether it's worth taking vitamin D supplements even if they don't help with Covid itself.

https://www.nhs.uk/conditions/vitamins-and-minerals/vitamin-d/

Expand full comment
founding

>Should they have checked for this right after randomizing, noticed the problem, and re-rolled their randomization to avoid it? I've never seen anyone discuss this point before. The purist in me is screaming no - if you re-roll your randomization on certain results, then it's not really random anymore, is it?

Better if they consider this *before* randomizing, and define quantitative criteria for what constitutes an acceptable randomization. If I roll 1d100 and, post facto, throw out results that "look too big" or "look too small", I've got a random distribution of middle-sized two-digit numbers biased by my intuitive assessment of "bigness" and "smallness". If I decide a priori to roll 1d100 and reject values below 15 or over 95, that's a truly random distribution over the range of 15-95.

With a small sample and many relevant confounding variables, it's likely that any single attempt at randomization is going to be biased one way or another on one of them. Fortunately, we've got computers and good random number generators; if you can define your "acceptable randomization" criteria mathematically, you can then reroll the dice a million times until you get something that fits those criteria - and without introducing biased human judgement into the assessment of any proposed randomization.

You've still got the potential for bias in the definition of your acceptability criteria, but that's an extension of the same bias you unavoidably introduced when you decided to look at e.g. blood pressure but not eye color. I don't think that becomes fundamentally worse if you add simple criteria like "for the variables we are looking at, an acceptable randomization must not deviate from the mean by more than one standard deviation".

Expand full comment

I think the “sensitivity analysis” where they adjust for blood pressure is adequate. It’s not an observational study so we know that the differences in blood pressure aren’t caused by some third factor other than randomization. Also you don’t have to worry too much about multiple testing corrections because both results are significant since it’s less likely the null hypothesis is true, and agreement by chance is only an issue with null hypothesis assumptions.

Expand full comment

For your Bayesian example you could solve it with a hierarchical model but that’s very technically challenging.

Expand full comment

For the first one: reshuffling until you get well-balanced treatment and control is legit. You're not looking at the effect size, so you don't introduce multiple hypothesis testing issues. This is the basis for genetic matching https://ideas.repec.org/a/tpr/restat/v95y2013i3p932-945.html More generally, the solution to this problem is matching https://www.wikiwand.com/en/Matching_(statistics) It's not uncontroversial (as usual, Judea Pearl has raised concerns), but people use it.

Expand full comment

Gary King, Dave Rubin, Jasjeet Sekhon, Alexis Diamond are the big names here if you want to look at their work.

Expand full comment

It feels like an important element missing in this discussion is that the problem of multiple testing arises from the fact that tests may be correlated. In the extreme, for example, if all libertarians are pro-immigration and all non-libertarians are against immigration, then testing the relation between ambidexterity and libertarianism will give you no additional information if you've already tested fo the relation between ambidexterity an pro/anti immigration positions.

Of course, taking into account these correlations might be very tricky, so the easiest fix I can think of is to use Montecarlo to build confidence intervals on some pre-determined set of variables, and use those to test whether your particular randomization passes the "statistically random" test.

Expand full comment

Regarding "Check along enough axes, and you'll eventually always find a "significant" difference between any two groups; if your threshold for "significant" is p < 0.05, it'll be after investigating around 20 possible confounders (pretty close to the 15 these people actually investigated)."

Once the number of confounders is 14, there's a better than 50% chance that at least one of them will be a false positive (0.95 ^ 14 = 0.488).

Expand full comment

For anyone less math-inclined who wants to find this number for different circumstances, the formula is ceil(log(overall max probability of no failures)/log(individual success probability)), where ceil means "round up."

So if you want to calculate how many 99% probability events you would need for a ≥10% chance of at least one failing (i.e. ≤90% chance that every event succeeds), the calculation is ceil(log(.90)/log(.99))=ceil(10.48)=11.

Expand full comment

I think people on this thread might enjoy the paper 'Multiple Studies and Evidential Defeat' by Matt Kotzen: http://dx.doi.org/10.1111/j.1468-0068.2010.00824.x or https://sci-hub.do/10.1111/j.1468-0068.2010.00824.x. Quoting the first three paragraphs:

'You read a study in a reputable medical journal in which the researchers re- port a statistically significant correlation between peanut butter consumption and low cholesterol. Under ordinary circumstances, most of us would take this study to be at least some evidence that there is a real causal connection of some sort (hereafter: a real connection) between peanut butter consumption and low cholesterol.

Later, you discover that this particular study isn’t the only study that the researchers conducted investigating a possible real connection between peanut butter consumption and some health characteristic. In fact, their research is being funded by the Peanut Growers of America, whose explicit goal is to find a statistical correlation between peanut butter consumption and some beneficial health characteristic or other that they can highlight in their advertising. Though the researchers would never falsify data or do anything straightforwardly scientifically unethical,1 they have conducted one thousand studies attempting to find statistically significant correlations between peanut butter consumption and one thousand different beneficial health characteristics including low incidence of heart disease, stroke, skin cancer, broken bones, acne, gingivitis, chronic fatigue syndrome, carpal tunnel syndrome, poor self-esteem, and so on.

When you find out about the existence of these other studies, should that at least partially undermine your confidence that there is a real connection between peanut butter consumption and low cholesterol?'

The paper basically argues on Bayesian grounds that the answer is 'no'. The takeaway for me is that the whole 'correcting for multiple hypothesis' thing is just some crude system that journal editors need to cook up in order to keep from being swamped by uninteresting papers, and not something that people who are just interested in finding out the truth about some question need to pay any attention to.

Expand full comment

For a real world example:

https://io9.gizmodo.com/i-fooled-millions-into-thinking-chocolate-helps-weight-1707251800

Author purposefully set up a study checking a bunch of different things, got a significant p-value between chocolate consumption and weight loss, and got published and heavily repeated in media as an experiment.

Expand full comment

My answer to this - null hypothesis statistical significance tests are bad and should be abandoned - see this post from Andrew Gelman:

https://statmodeling.stat.columbia.edu/2017/09/26/abandon-statistical-significance/

It's better to treat p-values almost like effect sizes, in that you develop an intuitive understanding of what different p-values mean. Then you apply Bayesian reasoning on whether the effect is plausible. My own personal intuition is something like this, when deciding whether to include effects in a multiple regression model:

p > 0.25 - Very little evidence. Include effect in model if there is very strong prior evidence. (very strong evidence might be stuff like "houses with more square footage sell for higher prices." The evidence should be nearly unassailable)

0.1 < p < 0.25 - Extremely weak evidence from data. Include effect in model if there is strong prior evidence. (I'd maybe say strong evidence would be something like "SSRIs have some benefit for anxiety and depression." There is maybe a small minority of evidence suggesting that SSRIs don't work, but overall there is a decently strong consensus that they do something).

0.1 < p < 0.05 - Weak evidence from data. Include effect in model if there is moderate prior evidence. (This should still be relatively settled science. Maybe the "SSRIs work" question would have had this level of evidence in 1995, when some data was available, but before we had tons of really good trials with lots of different SSRIs).

0.05 < p < 0.01 - Okay evidence from data. Should be some prior evidence, maybe a good causal explanation. (Maybe this would be the "SSRIs work" question in 1987 right after Prozac was approved. Some high quality data exists, but there aren't multiple studies looking at the problem from different angles).

0.001 < p < 0.01 - Moderate evidence from data. May sometimes accept result even in the absence of any good priors. I'd say your authoritarianism vs. ambidexterity result falls in this category, personally. The prior data isn't terribly convincing to me, so I'd want to see p-values in this range for me to consider this as a potentially real effect.

p < 0.001 - Strong evidence from data. Consider including effect even if weak/moderate prior evidence points in the opposite direction.

Expand full comment

I think there are a couple of things going on. First, you're often conflating effect size and significance. Just because something is significantly different between the groups does not mean it's a plausible alternative explanation (though it could be). It sounds like some other commenters have mentioned this. Also, if it is different, you could at least potentially do some sort of mediation analysis.

Second, you're not really thinking statistically. There's no certainty in any analysis; we're just comparing data to models and seeing how consistent or inconsistent they are. That gives us information but not conclusions, if that makes sense. More specifically, "correcting for multiple comparisons" is just ensuring that your false positive rate remains approximately 5% under certain assumptions. You can interpret p-values without that, and you still have to interpret them with that.

There might also be something similar to the gambler's fallacy going on here. Each individual p-value is independent (sort of), and you should expect about a 5% false positive rate, but if you anticipate doing 100 tests, you should anticipate a higher chance of getting *at least one* false positive.

A good case study of the statistical thinking issue is in the example you give. If you do one hundred tests and all of them come back p = 0.04, something has gone horribly, horribly wrong. The chance of all of them being just under the threshold is incredibly unlikely. In fact, many p-values just below .05 is often good evidence of p-hacking. I would interpret 100 cases of p = 0.04 as some sort of coding or computer error, or the strangest p-hacking I've ever seen. I know that's not the point of the thought experiment, but I think how you chose to construct the thought experiment is revealing of what you're misunderstanding here.

Expand full comment

The core of science is not significance testing but repeatability and replicability.

As you intuit, multicomparison testing trades false positives for false negative. This article and, more importantly, downloadable spreadsheet from Nature can help you explore that (https://www.nature.com/articles/nmeth.2900#MOESM343). Whether that’s “good” depends on what you’re trying to do. Want to declare something to be the ABSOLUTE TRUTH? Than suppress the false positives! Trying to figure out where to dive in deeper for your next study? Than do no corrections and explore the top statistical features in an intentional follow-up study. Intentionality is how we test for causality, which is what we’re really after in science.

I also highly advise understanding type M and type S errors, because overaggressive multicomparison testing leads to these at the meta-study level causing systemic replication issues. Andrew Gelman discusses them at various points in his books and on his great blog. Gelman is a Bayesian, but this point isn’t inherently Bayesian, just more intellectually obvious to someone working in a Bayesian paradigm

Expand full comment

For the four measures of authoritarianism problem, I'd combine them into one. You could average them or do a weighted average in whatever way made sense to you as long as you did so before seeing how each measure correlated individually.

Expand full comment

Sorry, I should clarify: I mean combine the raw data, not combine p-values. Basically you have the hypothesis that authoritarianism correlates with ambidexterity or whatever so you first get a single measure of a person's authoritarianism and of a person's ambidexterity and then look at the correlation of just those.

Expand full comment

Since your tests are likely highly positively correlated, you might want to use a resampling method, like the Westfall & Young minP test. But it can't be done from the summary stats in the post: you need the raw data and a bit of computing time. I see others here also mention this: it should be better known.

There are many surprising ways to get valid multiple-hypothesis rejection procedures that most people don't know about, but the key is you have to pick one ahead of time. Everyone uses Bonferroni so using Bonferroni doesn't raise any questions, but if you used "Scott's ad-hoc procedure" then everyone wonders whether it was cherry-picked. (A meta-multiple hypothesis testing correction might be needed?) Holmes-Bonferroni is also a good test because it's just strictly better than Bonferroni so really we should just use that as our default and it's not hard to defend. But in general, they all have weird edge-cases where you will be disappointed and surprised that it says "non-significant": that's inevitable.

But here's a simple alternative procedure. If you can order your hypotheses ahead of time (say, by your a priori expectation of how important they are), then reject starting at the top of the list all of those with p < 0.05 and stopping at the first one that's p>0.05. I.e. there's no Bonferroni-esque correction factor at all if you already have an order. If you have a good ordering of the importance, this might be the best you can do, but you better pre-commit to this since it's easily game-able.

Also, a pet peeve: Bonferroni doesn't assume independence. Independence is a very particular and restrictive assumption which rarely holds in the cases where you want Bonferroni. Bonferroni works *always*. The hardest case is negatively-correlated p-values, and Bonferroni still works for that. If your results were truly independent, then you would use Fisher's combined p-value, not Bonferroni correction, and get a low better power from it. There are corrections for Fisher's that supposedly take into account dependence (Brown's method) and that might be applicable here, but I don't entirely trust them (out of ignorance on my part of how they work).

Expand full comment

"I think the problem is that these corrections are for independent hypotheses, and I'm talking about testing the same hypothesis multiple ways (where each way adds some noise)." — No. The problem is your hypothesis/data ratio.

Given a fixed set of data, there are many more concepts that can be used to divide up the data than there is data itself. If you continually test new hypotheses without adding more data, then you will eventually stumble across a concept that happens to divide the data into "statistically significant" groups. But that doesn't mean the concept isn't an artificial distinction, like (got COVID)+(has golden retriever)! In fact, it very likely is.

The challenge here is that a hypothesis isn't just an English-language hypothesis. Every new mathematical specification of that English-language summary is effectively a new hypothesis, because it treats phenomena you _assume_ to be irrelevant differently. (See, e.g. Tal Yarkoni's "The Generalizability Crisis" (https://psyarxiv.com/jqw35).) This means that testing multiple mathematical specs of the same conceptual idea on the same data is almost certainly going to see a correlation arising from an artificial conglomeration of irrelevant factors, just by random chance.

To avoid this, each time you add a hypothesis, you need to either add more data or increase your skepticism. A Bonferroni-type correction is the latter option.

Expand full comment
founding

This is, indeed, a more difficult problem than most scientists are willing to admit.

For what it's worth, in my field (particle physics), we take a different approach:

We still talk about p-values and report experimental deviations from theory in terms of 𝜎-confidence levels (see recent "4.2𝜎 anomaly in muon g-2"). However, nobody has any idea how to quantify or correct for testing multiple hypotheses.

An example: in searches at the Large Hadron Collider, we search for particles of mass X and correct the statistical significance of some deviation in bin X due to the "look elsewhere effect" (LEE): what is the probability we would have seem *some* deviation at *some* mass bin? But we don't just measure mass/energy; in particle collisions we measure thousands of properties of (sometimes) dozens of outgoing particles. And these properties are basically all associated with *some* theory of New Physics. So... in a frequentist statistical sense, how many hypotheses are we testing? I think it's fair to say nobody has any idea.

Because of this, we run into stupid problems, like the fact that a 99% of our 3𝜎 experimental signals actually are just statistical anomalies, not the other way around as native p-values will suggest. Pop-science articles continue to say "this means there is only a 1 in a thousand chance of a statistical fluke!" and meanwhile smart particle physicists ignore such anomalies until more data comes in.

There's a growing contingent of physicists who want to go full Bayes, and maybe that's the right way to go. But the historical solution is more elegant: we set the threshold of "discovery" to 5𝜎, a threshold so obscenely high that we basically never claim discovery of a signal that goes away later.

Expand full comment

The thing that's confusing you is that p-value hypothesis testing only cares about the risk of declaring the hypothesis when it isn't true. This is similar to in court where they say "you must be 99% sure of guilt to declare guilt". So it's a weighted outcome.

So, from p-value hypothesis testing's point of view, if you do 100 tests (doesn't matter if they're the same or different -- just as long as the experiments themselves are independent), then that is just 100 opportunities to declare the alt-hypothesis when it isn't true, so your effective p-value quickly becomes "1" unless you rescale your individual p-values as you describe.

When you say that if you start viewing your career of paper-reading within the p-value-testing framework, you quickly find you can't accept anything -- that is exactly the case! Look at it this way, if you read 1 paper with a positive result, and I ask you "do you think you've read a false positive", you'll say "well, hopefully not if they've done everything right then with probability p this is a false positive", but if you read 1000 papers, then you have to say "yes, I'm sure there was a false positive in there".

The issue is you're misusing "p" as an uncertainty or accuracy parameter. When actually "p" is very specifically the probability of reading a false positive. And this is what's confusing you when interpreting what happens when you combine results (the p-value raises, when you think it should fall).

If you want to use p-value stuff for this question, you want to combine all information into a single test and set the p-value of that appropriately. But if I were you I'd just find a nice way to graph it and have done with it.

Expand full comment

"By replicating a true result, I've made it into a false one!"

This can happen easily when using hypothesis testing, as a test is allowed to be arbitrarily bad as long as it doesn't reject the null hypothesis too often when the null hypothesis is true. For example a valid test for a p-value of 0.05 is the following, I use a test using a p-value of 0.01, but 4% of the time I randomly reject H0. In addition, different tests might work well in different situations, but you are allowed to only pick one, in advance.

Dividing by the number of comparisons works best if the comparisons are independent, if you are testing for the same exact comparison multiple times, this is not true so it will not be a very good test, in the sense that it will accept H0 when other higher powered tests would not. But that is inherent in hypothesis testing, rejecting H0 tells you something, accepting it tells you very little, without more context (sometimes if you have a lot of data and you know you used a good test you can say look, if there really was any meaningful effect I would have been able to reject my null hypothesis by now...).

Expand full comment

Correct me if I'm wrong, but doesn't multiple hypothesis testing have more problems when you start looking at composite outcomes?

For example, if you are looking at an apixaban study where the primary outcome is any major cardiovascular event, death, stroke, or major bleeding you are testing multiple hypothesis at the same time, and therefore you need to correct the significance level to correct for that. Basically the correction is to make sure that if you are testing more than one hypothesis in the same test then you adjust significance to match.

If you are looking at a number of different hypothesis independently you might credibly be accused of p hacking (looking for any significant results in the data no matter what you originally set out to do), but I don't think that it is standard practice to adjust significance levels based on the number of secondary outcomes.

Expand full comment

Simplest solution: when you get such big difference between treatment and control group, split your study into two substudies: efficacy of vitamin D in hypertensive and normotensive patients. If you get stastically significant results in one or both substudies - make a bigger study.

Another thing that would be useful would be putting more emphasis on large number of small exploratory studies - it's better to do ten n=10 studies and then one n=100 study for the treatment that had the biggest effect size than n=200 study that leaves us confused what was the deciding factor

Expand full comment

I'm pitching a bit outside of my league here, but could you run a PCA on the authoritarian questionnaire? As far as I understand it, you could use a single 'authoritarian' metric and correlate that with ambidexterity, keeping your alpha at 0.05. This of course assumes that 4 questions would produces a meaningful metric and that the questions capture the authoritarian construct, but that's assumed anyways.

Expand full comment

I have never understood why in cases like this they test for significant group differences. Even under a frequentist paradigm that does not make any sense. We don't want to be reasonably sure that the groups differ but reasonably sure that they do not differ.

From a frequentist viewpoint the correct way would be to employ equivalence tests, as in https://journals.sagepub.com/doi/full/10.1177/2515245918770963

But if that approach were used, in many cases it would show the groups not to be equivalent (unless you had a really large sample size).

Expand full comment

My last sentence may be misleading, here another try:

But if that approach were used, in many cases it would not give significant evidence for the groups being equivalent (unless you had a really large sample size).

Expand full comment

TLDR: if you don't know what mean shift in 1-5 rating corresponds to what odds ratio, then it's impossible to integrate the two effects.

The main issue with doing a full Bayesian (or a frequentest analysis focusing on effect size with uncertainty) is that you have two different effect scales, one is a odds ratio and the other is a shift in an ordered predictor (that you treat as metric :P), if the were all on the same scale, then it's trivial to do Bayesian update on the full distributions using grid approximation. If you have some theory of how to convert between these scales, then you can transform to that space and then do Bayesian update via grid approximation.

also, if your hypothesis is that ambidextrous people are less main stream, then you should test for a difference in sigma instead of a difference in mu :)

Expand full comment

In a bayesian analysis you can never just multiple odds-factors together from experiments, unless they are completely independent, which they almost never are. So the right calculation to the above would be:

1. 19:1 in favor

2. Odds of 2 being true conditional on 1

3. Odds of 3 being true conditional on 1 and 2

4. Odds of 4 being true conditional on 1 and 2 and 3

Of course, figuring out what the conditional probabilities are is hard and requires some underlying world model that establishes the relationships between those factors.

Expand full comment

The good news is that randomization isn't about balancing the samples relative to baseline characteristics anyway:

https://errorstatistics.com/2020/04/20/s-senn-randomisation-is-not-about-balance-nor-about-homogeneity-but-about-randomness-guest-post/

So this sort of "search for things that are different after randomization and call them confounders" isn't a problem that actually needs addressing.

Well, I guess it would be nice if people stopped looking for differences at baseline in randomized studies. *That's* a problem that needs addressing. And would save a lot of paper and / or electrons.

Expand full comment

I commented above on the part 1, on which (I now notice) several commenters have either similar or more enlightened things to say.

So about point (2). The Bayes factors have already been discussed. Commenter Tolkhoff mentioned hierarchical models. As an exercise and as a student of statistics, here is how I'd begin an analysis, if I were to do state of the art Bayesian inference and try to fit a Bayesian hierarchical (regression) model.

We have four measured (binary and ordinal) outcomes, let us label them A, B, C, D, which are assumed to noisily measure common underlying factor related to political level of ideology, Y, which is not directly observable. One wishes to use them to tell how much they (or actually, Y) are related to ambidexterity, X, binary variable that is known. To simplify the presentation a little bit, I pretend the 1-5 ordinals are also binary outcome variables by categorizing them (there are better ways, but this is an illustration).

In other words, we hypothesize there is some variable associated with ambidexterity-relevant-authoritarianism, Y, which is of some value if person is ambidextrous (X=1) and some other value if not (X=0), with some noise. I write, Y = bX + e, with e ~ N(0, sigma_y). Parameter b is the effect I wish to infer, sigma_y relates to magnitude of variation.

We further hypothesize people are more likely answer positively to question A if they have high level of Y. More precisely, say, they answer positively to question A with probability p_Y which somehow depends on Y. Statistically, observing a number of positive events n_A out of n_samples trials follows binomial distribution. Writing this in terms of distributions, n_A ~ Binomial(p_Y, n_samples). So what about probability p_Y? The classical way to specify the "it somehow depends on Y" part is to model p_Y = sigmoid(Y). Similarly for B, C, and D. After specifying some reasonable priors for effect b and sigma_y, one could try to fit this model and then look at inferred posterior distribution of b (and sigma) to read the Bayesian tea leaves to say something about the effect relative to noise.

One should notice that in this model, with shared parameter p_Y, I have assumed that all differences between number of positive answers to different questions would be due to random variation only. All the differences between how well the different questions capture the "authoritarianism as it relates to ambidexterity" factor would be eaten by the noise term e. More likely, one would like to fit a model that incorporates a separate effect how each question is related to the purported authoritarianism variable. I don't know if there is any canonical way to go about this, but one simple variation to initial model would be to replace b with separate effects b_A, b_B, ... (and thus separate probabilities p_Y,A, p_Y,B) for each question A, B, ... , where b_A, b_B, ... share common prior about common effect b (maybe writing b_A ~ b + N(0, sigma_ba), and some hyperprior for sigma_bas).

Sounds complicated? The hierarchical logic is less complicated than it sounds, but obtaining a model that fits and interpreting it can require some work. (The model I drafted is not necessarily the best one.) The positive part here is, there is no a pressing need to think about P-values or multiple hypothesis corrections. One obtains a posterior distribution for the common effect b and separate parts b_A, b_B, ..., which provides some level of support for or against each hypothesis, and fitted model provides a predictive distribution about future data. If one adds more questions (hypotheses) E,F,G, they either provide information about the common effect (affecting the shared beta) or not (affecting the variance estimates). Or, so the theory says.

Expand full comment

The difference is post-hoc analysis versus prediction. There are an infinite number of potential factors we could stratify against, making post-hoc statistical hypothesis testing an arbitrary artifact of the number of factors we choose to look at. The best this kind of thing can do is give us an observation to make a hypothesis based off of. We'd need to do another study to test that hypothesis generated by the post-hoc analysis.

In contrast, the researchers who said, "We think it's Vitamin D" had to pick the specific factor they thought would be significant before the experiment took place. The entire difference is whether you're able to predict the specific factor that will be different, versus being able to pick enough factors that one will be different by chance alone.

This is why you often hear "early study produces statistically strong p-value of unexpected finding", only to be followed up by a later study that shows no significant difference. The first study didn't show anything; it generated a hypothesis that hadn't been tested yet. It was the predictive study that actually tested the hypothesis - and found it was statistical noise.

Take this to your multiple hypothesis test of ambidextrous people. When you try to test the same hypothesis different ways, you're doing serial prediction-based tests of the same hypothesis - NOT doing multiple hypothesis testing.

(As you pointed out each different test comes with its own sub-hypothesis, which may be incorrect. For example, as someone with no handed preference I thought the Libertarian test was particularly poor for testing cognitive closure. I remember answering the question on the survey and not selecting Libertarian, because of the strong ideological opposition to state intervention. For me, all strong ideological positions feel like trying to force-fit the observational square peg into the ideologically round hole.)

So you can ask whether each test actually tested the hypothesis, or whether they tested a new hypothesis (e.g. 'Libertarian is a proxy for cognitive closure' - probably not), but otherwise you shouldn't treat serial testing of the same hypothesis as multiple hypothesis testing. The key is to ask whether you're making a prediction before the beginning of the experiment or not.

https://www.marklwebb.com/2019/05/what-studies-actually-show.html

Expand full comment

On rerandomization, see here:

https://www.aeaweb.org/articles?id=10.1257/aer.20171634

"We show that rerandomization creates a trade-off between subjective performance and robust performance guarantees. However, robust performance guarantees diminish very slowly with the number of rerandomizations. This suggests that moderate levels of rerandomization usefully expand the set of acceptable compromises between subjective performance and robustness."

Expand full comment

The goal of randomization is ONLY to ensure the two groups are as similar as possible at the start of your experiment. There’s nothing statistically fancy about it. We use randomization because it’s easy, and because it should cover potential confounding factors we didn’t think of, as well as the ones we did.

So there would be nothing wrong with re-randomizing until you got any obvious confounding factors pretty even, as long as you do it before starting the experimental procedure.

Or you could pair your participants up on the few variables you expect to be confounding, and randomly assign them to your groups. ie; take 2 women between 45 and 50 ys, both with healthy blood pressure and no diabetes from your sample. Then flip a coin to see which goes into which group. Then take 2 men between 70 and 75 ys, both moderately high blood pressure but no diabetes, flip a coin ....

Expand full comment

About problem II, as you write you're testing the same hypothesis in a few different ways, not many disparate hypotheses. In this case, it's best to perform a test of joint significance, like an F-test. This is a single test which takes into account all the effects in the different outcome variables together to figure out if the difference between groups is significant.

Expand full comment

" Should they have checked for this right after randomizing, noticed the problem, and re-rolled their randomization to avoid it? I've never seen anyone discuss this point before." This is called dynamic randomization. Assignment depends on the current subjects strata AND all the strata before. If I recall, it is hard to do a re-randomization test, and you can not always guarantee the large sample properties of your test statistics.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3588596/#:~:text=Here%2C%20'dynamic%20randomization'%20refers,assignments%20of%20previously%20randomized%20patients.

Expand full comment

About re-rolling the randomization in case of a bad assignment of people to groups, in addition to Stratified Sampling that others have pointed out, I’d like to point out that re-rolling is called Rejection Sampling, and it is a useful thing to do.

Suppose you want to sample uniformly from some set S (say, possible assignments of people into equally-sized groups with no significant difference in blood pressure), but you don’t have an easy way to sample from S. If S is a subset of a larger set T (say, possible assignments of people into equally-sized groups), and you have a way to test whether an element of T belongs to S, and you do have a way to uniformly sample from T, then you can turn this into a way to uniformly sample from S, using the re-rolling method you describe: draw an element of T, and if it’s not in S, try again. This is called Rejection Sampling.

Expand full comment

“But this raises a bigger issue - every randomized trial will have this problem. Or, at least, it will if the investigators are careful and check many confounders. [...] I don't think there's a formal statistical answer for this.”

I don’t think this is quite right. Treatment effect estimation in control trials doesn’t necessarily require randomisation, and either way one can balance groups on characteristics observed before the trial. See e.g. Kasy (2016): https://scholar.harvard.edu/files/kasy/files/experimentaldesign.pdf

Expand full comment

Agree, well said, been saying similar for years. Here the 'MHT correction' *helps them*.

But researchers shouldn't give themselves the benefit of the doubt when dealing with diagnostic testing nor considering the possibility of a problem with the design that could have a massive impact on the accuracy of the study.

We don't need to 'strongly reject the null' when we smell smoke, before we are worried that there is a fire in our laboratory.

We really need an approach that puts boundaries on the *extent of the possible bias* and then the researcher needs to demonstrate that the 'extreme probability bounds' on such a bias are still fairly small.

I started to discuss this in my general notes [here](https://daaronr.github.io/metrics_discussion/robust-diag.html) ... hope to elaborate

Expand full comment

Your ambidexterity/authoritarianism example sounds like a situation where principal component analysis (PCA) could be helpful. Rather than using four different questions that reflect a single underlying, latent concept, you use PCA to tease out a better single measure of the target latent concept. This is used for economic predictions, when many different indicators are thought to give information about the overall state of the economy, but also in political science where the same logic applies.

Expand full comment