I have no idea what the actual scientists would say, but the mechanism of "Vitamin D regulates calcium absorption, and calcium is really important for the activation of the immune system, and the immune system fights off viral infections" seems like a decent guess! At least strong enough to make testing worthwhile and, given enormous D-deficiency rates in many of the tested populations, a reasonable candidate for hypothesis testing.

The correct way to do it from a Bayesian perspective is to compare the probability density of the exact outcome you got given h0 Vs h1, and update according to their ratio.

In practice the intuition is that if you were 100 % sure you would get around this result given h1, then a p = 0.05 would indeed make you 20 times as certain in h1. If you expected to get a bit higher or a bit lower, it would make you less than 20 times as certain (or even reduce your certainty). If you expected to get pretty much exactly that result, it could make you thousands of times more certain.

TLDR: you have to look at both p given h0, and p given h1 to do Bayesian analysis

Right. Except that p=0.05 means 5% chance of this outcome or an outcome more extreme, so if we are talking about densities as you say and we were sure we would get this *exact* outcome in h1 world and we got it, that should make us update way more than 20 times.

Yeah, I misread. But I still don't get what you meant by "if you were 100 % sure you would get around this result given h1, then a p = 0.05 would indeed make you 20 times as certain in h1.". Are you just describing a common wrong intuition, or claiming that the intuition makes sense if you were sure your result would be "around" in h1? If so, why does that make sense?

Yeah I wasn't very precise there. What I think I meant is that for some distributions of h1 it would be around 20 times higher. I think it's too late to edit my post, but if I could I would delete that.

Maybe, if in h1 it was around 100% sure that one would get a result equal or more extreme than the actual result, then one would get that 20 factor, right?

This clearly wouldn't be using ALL the information one has. It feels weird though to choose what subset of the information to use and which to ignore when updating depending on the result.

because the libertarian result is pretty likely if the authoritarianism thesis is false, (that's where the p-value is coming from) but unlikely if the authoritarianism thesis is true. That's why you can't do any of this with just p-values.

The prior probability of h0 is 0 (no pair of interesting variables in a complex system have zero association) so in a Bayesian analysis no data should update you in favor (or against) it.

I don't get that--probabilities are the thing that has real world interpretation we care about. In a Bayesian analysis it just makes more sense to think in terms of distributions of effect sizes than null hypothesis testing.

True Bayesian analysis is impossible. It requires enumerating the set of all possible universes which are consistent with the evidence you see, and calculation what percentage have feature X youre interested in. This is an approximation of Bayesian analysis which updates you towards either h0 (these variables are not strongly linked) and h1 (these variables are strongly linked).

"All possible universes" is just the 4D hypercube representing the possible true differences in group means for the four questions considered. Given a prior on that space it is straightforward to calculate the posterior from the data. The only tricky part is figuring out how to best represent your prior.

The key is to have a large enough study such that if there are 20 potentially relevant factors, even though one of them will probably show a significant difference between groups, that difference will be too small to explain any difference in results. Here the study was tiny, so one group had 25% high blood pressure and the other over 50%.

This is a very good point. However it's not always possible to just "run a bigger study". (In this case, sure, that's probably the right answer). Is it possible to do better with the small sample-size that we have here?

For example, I'm wondering if you can define a constraint on your randomized sampling process that prevents uneven distributions -- "reroll if it looks too biased" in Scott's phrasing, or possibly reweighting the probabilities of drawing each participant based on the population that's been drawn already. Essentially you want the sampled groups to have a distribution in each dimension that matches the sampled population, and that match each other. I'm sure this is something that's common in clinical trial design, and I'll try to dig up some citations for how this is done in the industry.

What you're referring to is called blocking. At the most extreme you pair each person to one other person in such a way that the expected imbalance after randomization is minimized.

Just to clarify, since I'm not totally sure I get the argument, is the idea here that as sample size increases, you are less likely to end up with a large difference in the groups (e.g., in this case, you are less likely to end up with wildly different proportions of people with high blood pressure)?

I think it's worth noting that testing for statistically significant differences in covariates is nonsensical. If you are randomizing appropriately, you know that the null hypothesis is true - the groups differ only because of random chance. If you end up with a potentially important difference between the groups, it doesn't really matter if it's statistically significant or not. Rather, it matters to whatever extent the difference influences the outcome of interest.

In this case, if blood pressure is strongly related to covid recovery, even with a large sample size and a small difference between the groups, it would be a good idea to adjust for blood pressure, simply because you know it matters and you know the groups differ with respect to it.

Note that, while it's true that larger sample sizes will be less likely to produce dramatic differences in covariates between randomized treatment and control groups, larger sample sizes will also increase the chance that even small differences in relevant covariates will be statistically significantly related to differences in the outcome of interest, by virtue of increasing the power to detect even small effects.

On the topic of your point in part I, this seems like a place where stratified sampling would help. Essentially, you run tests on all of your patients, and then perform your random sampling in such a way that the distribution of test results for each of the two subgroups is the same. This becomes somewhat more difficult to do the more different types of tests you run, but it shouldn't be an insurmountable barrier.

It's worth mentioning that I've seen stratified sampling being used in some ML applications for exactly this reason - you want to make sure your training and test sets share the same characteristics.

I was going to propose something like this, unaware it already had a name. I imagine it's harder to stratify perfectly when have 14 conditions and 76 people but it would be the way to go.

If someone hasn't already, someone should write a stratification web tool that doctors or other non-statisticians can just plug in "Here's a spreadsheet of attributes, make two groups as similar as possible". It should just be something you are expected to do, like doing a power calculation before setting up your experiment.

I am a couple of days late, but yes, this is what I would suggest, with caveat I don't have much formal studies in experimental design. I believe in addition to "stratified sampling", another search key word one wants is "blocking".

Intuitively, when considering causality, it makes sense. One hypothesizes that two things, confounding factor A and intervention B could both have an effect, and one is interested in the effect of the intervention B. To be on the safe side

The intuitive reason here is that randomization is not magic, but serves a purpose: reducing statistical bias from unknown confounding factors.

Abstracted lesson I have learned from several practical mistakes over the years: Suppose I have 1000 data items, say, math exams scores of students from all schools in a city. As far as I am concerned, I do not know if there is any structure in the data, so I call it random and I split the samples into two groups of 500, first 50 in one and rest in another? What if later I have been told that the person who collected the data reported them in a spatial order of the school districts as they parsed the records, forgot to include the district information, but now tells me the first 500 students includes pupils from the affluent district with the school that has nationally renowned elite math program? Unknown factor has now become known factor, while the data and sample order remained the same! Doing randomization by yourself is the way I could have avoided this kind of bias. But if the person tells me the districts of the students beforehand, I can actually take this in account while designing my statistical model.

Were those really the only questions you considered for the ambidexterity thing? I had assumed, based on the publication date, that it was cherry-picked to hell and back.

There's good reason to be suspicious of the result (as I'm sure Scott would agree - one blogger doing one test with one attempted multiple-hypothesis test correction just isn't that strong evidence, especially when it points the opposite direction of the paper he was trying to replicate).

I think there's no reason to be suspicious that Scott was actually doing an April Fools - it just wasn't a good one, and Scott tends to do good twist endings when he has a twist in mind.

One comment here: if you have a hundred different (uncorrelated) versions of the hypothesis, it would be *hella weird* if they all came back around p=.05. just by random chance, if you'd expect any individual one of them to come back at p=0.05, then if you run 100 of them you'd expect them one of them to be unusually lucky and get back at p=.0005 (and another one to be unusually unlucky and end up at p=0.5 or something). Actually getting 100 independent results at p=.05 is too unlikely to be plausible.

Of course, IRL you don't expect these to be independent - you expect both the underlying thing you're trying to predict and the error sources to be correlated across them. This is where it gets messy - you sort of have to guess how correlated these are across your different metrics (e.g. high blood pressure would probably be highly correlated with cholesterol or something, but only weakly correlated with owning golden retrievers). And that I'm itself is kind of a judgement call, which introduces another error source.

What I generally use and see used for multiple testing correction is the Benjamini-Hochberg method - my understanding is that essentially it looks at all the p-values and for each threshold X it compares to how many p-values<X you got vs how many you'd expect by chance, and adjusts them up depending on how "by chance" they look based on that. In particular, in your "99 0.04 results and one 0.06" example it only adjusts the 0.04s to 0.040404.

But generally speaking you're right, all those methods are designed for independent tests, and if you're testing the same thing on the same data, you have no good way of estimating how dependent one result is on another so you're sort of screwed unless you're all right with over-correcting massively. (Really what you're supposed to do is pick the "best" method by applying however many methods you want to one dataset and picking your favorite result, and then replicate that using just that method on a new dataset. Or split your data in half to start with. But with N<100 it's hard to do that kind of thing.)

For a bayesian analysis, it's hard to treat "has an effect" as a hypothesis and you probably don't want to. You need to treat each possible effect size as a hypothesis, and have a probability density function rather than a probability for both prior and posterior.

You *could* go the other way if you want to mess around with dirac delta functions, but you don't want that. Besides, the probability that any organic chemical will have literally zero effect on any biological process is essentially zero to begin with.

And the effect size is what you actually want to know, if you are trying to balance the benefits of a drug against side effects or a of social programme against costs.

Has there already been any formal work done on generating probability density functions for statistical inferences, instead of just picking an overly specific estimate? This seems like you're on to something big.

Meteorology and economics, for example, return results like "The wind Friday at 4pm will be 9 knots out of the northeast" and "the CPI for October will be 1.8% annualized", which inevitably end up being wrong almost all of the time. If I'm planning a party outside Friday at 4pm I'd much rather see a graph showing the expected wind magnitude range than a single point estimate.

If the observations all follow a certain form, and the prior follows the conjugate form, the posterior will also follow the conjugate form, just with different parameters. Most conjugate forms can include an extremely uninformative version, such as a gaussian with near-infinite variance.

The gaussian's conjugate prior is itself, which is very convenient.

The conjugate prior for bernoulli data (a series of yes or nos) is called a Beta Distribution (x^a(1-x)^b) and is probably what we want here. It's higher-dimensional variation is called a Dirichlet Distribution.

This is true, Scott's "Bayes factors" are completely the wrong ones. It's also true that the "real" Bayes factors are hard to calculate and depend on your prior of what the effect is, but just guessing a single possible effect size and calculating them will give you a pretty good idea — usually much better than dumb old hypothesis testing.

For example, let's assume that the effect size is 2 SD. This way any result 1 SD and up will give a Bayes factor in favor of the effect, anything 1 SD or below (or negative) will be in favor of H0. Translating the p values to SD results we get:

1. p = 0.049 ~ 2 SD

2. p = 0.008 ~ 2.65

3. p = 0.48 ~ 0

4. p = 0.052 ~ 2

For each one, we compare the height of the density function at the result's distance from 0 and from 2 (our guessed at effect), for example using NORMDIST(result - hypothesis,0,1,0) in Excel.

1. p = 0.049 ~ 2 SD = .4 density for H1 and .05 density for H0 = 8:1 Bayes factor in favor.

2. p = 0.008 ~ 2.65 = .32 (density at .65) and .012 (density at 2.65) = 27:1 Bayes factor.

3. p = 0.48 ~ 0 = 1:8 Bayes factor against

4. p = 0.052 ~ 2 = 8:1 Bayes factor in favor

So overall you get 8*8*27/8 = 216 Bayes factor in favor of ambidexterity being associated with authoritarian answers. This is skewed a bit high by a choice of effect size that hit exactly on the results of the two of the questions, but it wouldn't be much different if we chose 1.5 or 2.5 or whatever. If we used some prior distribution of possible positive effect sizes from 0.1 to infinity maybe the factor would be 50:1 instead of 216:1, but still pretty convincing.

I don't understand what you are doing here. Aren't you still double-counting evidence by treating separate experiments as independent? After you condition on the first test result, the odds ratio for the other ones has to go down because you are now presumably more confident that the 2SD hypothesis is true.

You're making a mistake here. p = 0.049 is nearly on the nose if the true effect size is 2 SD *and you have one sample*. But imagine you have a hundred samples, then p = 0.049 is actually strong evidence in favor of the null hypothesis. In general the standard deviation of an average falls like the square root of the sample size. So p = 0.049 is like 2/sqrt(100) = .2 standard deviations away from the mean, not 2.

Not directly answering the question at hand, but there's a good literature that examines how to design/build experiments. One paper by Banerjee et al. (yes, the recent Nobel winner) finds that you can rerandomize multiple times for balance considerations without much harm to the performance of your results: https://www.aeaweb.org/articles?id=10.1257/aer.20171634

So this experiment likely _could_ have been designed in a way that doesn't run into these balance problems

So this is fine in studies on healthy people where you can give all of them the drug on the same day. If you’re doing a trial on a disease that’s hard to diagnose then the recruitment is staggered and rerandomizing is impossible.

In the paper I linked above, Appendix A directly speaks to your concern on staggered recruitment. Long story short, sequential rerandomization is still close to first-best and should still be credible if you do it in a sensible manner.

On the ambidextrous analysis, I think what you want for a Bayesian approach is to say, before you look at the results of each of the 4 variables, how much you think a success or failure in one of them updates your prior on the others. e.g. I could imagine a world where you thought you were testing almost exactly the same thing and the outlier was a big problem (and the 3 that showed a good result really only counted as 1.epsilon good results), and I could also imagine a world in which you thought they were actually pretty different and there was therefore more meaning to getting three good results and less bad meaning to one non-result. Of course the way I've defined it, there's a clear incentive to say they are very different (it can only improve the likelihood of getting a significant outcome across all four), so you'd have to do something about that. But I think the key point is you should ideally have said something about how much meaning to read across the variables before you looked at all of them.

"I don't think there's a formal statistical answer for this."

Matched pairs? Ordinarily only gender and age, but in principal you can do matched pairs on arbitrarily many characteristics. You will at some point have a hard time making matches, but if you can match age and gender in, say 20 significant buckets, and then you have say five binary health characteristics you think might be significant, you would have about 640 groups. You'd probably need a study of thousands to feel sure you could at least approximately pair everyone off.

Hmm, wondering if you can do a retrospective randomized matched pairs subset based on random selections from the data (or have a computerized process to do a large number of such sub-sample matchings). Retrospectively construct matched pair groups on age, gender and blood pressure. Randomize the retrospective groupings to the extent possible. Re-run the analysis. Redo many times to explore the possible random subsets.

Isn't the point to put each member of a matched pair into the control or the intervention group? If so you'd need to do it first (i.e. in Scott's example before administering the vitamin D).

Not necessarily so. You could I think take an experiment that hadn't been constructed as a matched pair trial and then choose pairs that match on your important variables, one from treatment, one from control, as a sub-sample. If for any given pair you choose there are multiple candidates in the treatment and control group to select, you can "randomize" by deciding which two candidates constitute a pair. Now you've got a subsample of your original sample consisting of random matched pairs, each of which has one treatment and one control.

Post hoc you have a set of records, each of which has values for control (yes/no), age, blood pressure, etc., and outcome. You can then calculate a matching score on whichever parameters you like for each pair of intervention-control subjects. Then for each intervention subject you can rank the control subjects by descending match score and select the top 1 (or more if desired). This is matching with replacement, but could also be done without replacement.

There are many ways to achieve this after the fact is what I am saying. None are perfect but most are reasonable.

I'm not convinced there is actually an issue. Whenever we get a positive result in any scientific experiment there is always *some* chance that the result we get will be random chance rather than because of a real effect. All of this debate seems to be about analyzing a piece of that randomness and declaring it to be a unique problem.

If we do our randomization properly, on average, some number of experiments will produce false results, but we knew this already. It is not a new problem. That is why we need to be careful to never put too much weight in a single study. The possibility of these sorts of discrepancies is a piece of that issue, not a new issue. The epistemological safeguards we already have in place handle it without any extra procedure to specifically try to counter it.

Matched pair studies seem fine, and I see the benefit of them, I am just not convinced they are always necessary, and they can't solve this problem.

Ultimately there are an arbitrary number of possible confounding variables, and no matter how much matching you do, there will always be some that "invalidate" your study. You don't even know which are truly relevant. If you did humanity would be done doing science.

If you were able to do matched pair study that matched everything, not just age and gender, you would have to be comparing truly identical people. At that point, your study would be incredibly powerful, it would be something fundamentally better than an RCT, but obviously this is impossible.

In any given study one starts with the supposition that some things have a chance of being relevant. For example, you may think some treatment, such as administering vitamin D, has a chance of preventing an illness or some of its symptoms. There are other things that are outside what you can control or affect that you also think may well be relevant, though you hope not too much. And finally there are an unlimited number of things that you think are very unlikely or that you have no reason to believe would be relevant but that you cannot rule out.

You seem to be suggesting everything in the second category should be treated as though it were in the third category, or else everything in the third category ought to be treated as though it belonged in the second category. But the world is not like that. It is not the case that the existence of things in the third category means that there are always more things that should have been in the second category. The two categories are not the same.

Although the third category is larger than the second category, the second category is also practically unlimited. Also, the exact location of the border between the two is subjective.

It seems like it would be impossible to explicitly correct for every possible element of the second category, but if you don't, it isn't clear that you are accomplishing very much.

It is not practically unlimited. At any given time there will typically be a small finite number of confounders of serious importance. Blood pressure is an obvious one. You are making a slippery slope argument against dealing with confounders, but there's no slope, much less a slippery one.

It seems to me to be a huge number. I am considering:

- Preexisting medical conditions

- Age, gender, race/ethnicity

- Other drugs/medical care (both for COVID and for preexisting conditions)

- Environmental factors

- Every possible permutation of previously elements on the list

Which do you think aren't actually relevant?

I am not making a slippery slope argument. I'm not sure if you are misinterpreting my comment, or misusing the name of the argument, but either way, you are incorrect. If you clarify what you meant, I will explain in more detail.

The idea behind matched pair studies is that the pairing evens out Althea known potentially confounding variables, while the randomization within the pair (coin toss between the two as to who goes into experimental group and who into control) should take care of the unknown ones.

Stratification are also a valuable tool, but I am not convinced they are necessary either, and using them inherently introduces p-hacking concerns and weakens the available evidence.

Both stratification and matching exist to deal with problems like this. Maybe a lot of times it's not necessary, but this post is entirely about a case where it might be: because of the high blood pressure confounder. I'm not sure what to make of the idea that the solution is worse than the problem: why do you think that?

I do not believe that the solution is worse than the problem, at least not in such an absolute sense.

What I actually believe is that the solution is not generally necessary. I also believe that in most situations where the experiment has already been done, the fact that one of these solutions had not been applied shouldn't have a significant impact on our credence about the study.

When you say "some of them", do you mean some of the studies, or some of the variables? I think you meant some of the studies, so I am going to respond to that. Please correct me if I am wrong.

I think that this is happening in (almost?) all of the studies. It is just a question of if we happen to notice the particular set of variables that it is happening for. I think that the section of TFA about golden retrievers is a little bit misleading. Even considering only variables that could conceivably be relevant, there are still a nearly infinite number of possible variables. The question of if it gets noticed for a particular study is arbitrary, and more related to which variables happened to get checked than the strength of the study itself.

I would agree; when we say "there is a 5% chance of getting this result by random chance", then this is exactly the sort of scenario which is included in that five percent.

But what is currently doing my head in is this: once we know that there _is_ a significant difference between test and control groups in a potentially-significant variable, are we obliged to adjust our estimate that the observed effect might be due to random chance?

And if we are, then are we obliged to go out fishing for every other possible difference between our test and control groups?

I agree with you. I think there are decent arguments on both sides.

Arguments in favor of doing these corrections:

- Once we identify studies where there is a difference in a significant variable, that means that this particular study is more likely to be the result of chance

- Correcting can only improve our accuracy because we are removing questions where the randomization is causing interference.

Arguments against doing the corrections:

- There is always a significant variable that is randomized poorly (because of how many there are), when we notice it, that tells us more about what we can notice than it does about the study itself.

- A bunch of these sorts of errors counter each other out. Removing some of them is liable to have unforeseen consequences.

- Ignoring the issue altogether doesn't do any worse on average. Trying to correct in only specific instances could cause problems if you aren't precise about it.

It seems complicated enough that I don't fully know which way is correct, but I am leaning against making the corrections.

If your original confidence level included the possibility that the study might have had some bias that you haven't checked for, then in principle, when you subsequently go fishing for biases, every bias you find should give you less faith in the study, BUT every bias that you DON'T find should give you MORE faith. You are either eliminating the possible worlds where that bias happened or eliminating the possible worlds where it didn't happen.

Of course, counting up all the biases that your scrutiny could've found (but didn't) is hard.

I don't think it quite works like that. The possible biases are equally likely to help you or hurt you. Before you actually examine them their expected impact is 0 (assuming you are running a proper RCT). Every bias that you find that find that helped the result resolve as it did would give you less credence in that result. Every bias that was in the opposite direction would give you more. The biases you don't know about should average out.

You basically never want to be trying to base your analysis on combined P factors directly. You want to--as you said--combine together the underlying data sets and create a P factor on that. Or, alternately, treat each as a meta-data-point with some stdev of uncertainty and then find the (lower) stdev of their combined evidence. (Assuming they're all fully non-independent in what they're trying to test.)

A good way to handle the problem of failed randomizations is with re-weighting based on propensity scores (or other similar methods, but PSs is the most common). In brief, you use your confounders to predict the probability of having received the treatment, and re-weight the sample depending on the predicted probabilities of treatment. The end result of a properly re-balanced sample is that, whatever the confouding effect of blood pressure on COVID-19 outcomes, it confounds both treated and untreated groups with equal strength (in the same direction). Usually you see this method talked about in terms of large observational data sets, but it's equally applicable to anything (with the appropriate statistical and inferential caveats). Perfectly balanced data sets, like from a randomized complete block design, have constant propensity scores by construction, which is just another way of saying they're perfectly balanced across all measured confounders.

For the p-value problem, whoever comes in and talks about doing it Bayesian I believe is correct. I like to think of significance testing as a sensitivity/specificity/positive predictive value problem. A patient comes in from a population with a certain prevalence of a disease (aka a prior), you apply a test with certain error statistics (sens/spec), and use Bayes' rule to compute the positive predictive value (assuming it comes back positive, NPV otherwise). If you were to do another test, you would use the old PPV in place of the original prevalence, and do Bayes again. Without updating your prior, doing a bunch of p-value based inferences is the same as applying a diagnostic test a bunch of different times without updating your believed probability that the person has the disease. This is clearly nonsense, and for me at least it helps to illustrate the error in the multiple hypothesis test setting.

Finally, seeing my name in my favorite blog has made my day. Thank you, Dr. Alexander.

> Should they have checked for this right after randomizing, noticed the problem, and re-rolled their randomization to avoid it?

The point of "randomizing" is to drown out factors that we don't know about. But given that we know that blood pressure is important, it's insane to throw away that information and not use it to divide the participant set.

I think the proper way to do this might be stratified sampling [1]. Divide the population into all relevant subgroups that you know about and then sample from each subgroup at the same rate to fill your two groups.

A very simple test I'd like to see would be to re-run the analysis with all high-blood-pressure patients removed. (Maybe that's what they did when 'controlling for blood pressure' - or maybe they used some statistical methodology. The simple test would be hard to get wrong.)

If you did that you'd have only eleven patients left in the control group, which I'm gonna wildly guess would leave you with something statistically insignificant.

Well, that's the trouble with small tests. If the blood pressure confound is basically inseparable, and blood pressure is a likely Covid signifier, the result is just not all that strong for Vitamin D. One's priors won't get much of a kick.

>I think the proper way to do this might be stratified sampling [1]. Divide the population into all relevant subgroups that you know about and then sample from each subgroup at the same rate to fill your two groups.

I don't think that works for an n=76 study with fifteen identified cofounders, or at least the simple version doesn't. You can't just select a bunch of patients from the "high blood pressure" group and then from the "low blood pressure" group, and then from the "over 60" group and the "under 60 group", and then the "male" group and the "female group", etc, etc, because each test subject is a member of three of those groups simultaneously. By the time you get to e.g. the "men over 60 with low blood pressure" group, with n=76 each of your subgroups has an average of 9.5 members. With fifteen cofounders to look at, you've got 32,768 subgroups, each of which has an average of 0.002 members.

If you can't afford to recruit at least 65,000 test subjects, you're going to need something more sophisticated than that.

This topic is actually well discussed among randomista econometricians. I believe they used to advise "rerolling" until you get all confounders to be balanced, but later thought it might create correlations or selection on *unobservable* confounders, so weakly advised against it.

I agree that stratification of some sort is what I would try.

For something more sophisticated, see these slides[1] by Chernozhukov which suggest something called post-double-selection.

The post-double-selection method is to select all covariates that predict either treatment assignment or the outcome by some measure of prediction (t-test, Lasso, ...). Then including those covariates in the final regression, and using the same confidence intervals.

Regarding Bayes. If a test (that is independent of the other tests) gives a bayes a factor of 1:1, then that means that the test tells you nothing. Like, if you tested the Vitamin D thing by tossing a coin. It's no surprise that it doesn't change anything.

If a test for vit D and covid has lots of participants and the treatment group doesn't do better, that's not a 1:1 bayes factor for any hypotheses where vit d helps significantly. The result is way less likely to happen in the world where vit d helps a lot.

The horizontal axis are the results of a test. In black, the probability density in the null world (vit D doesn't help for covid). In blue, the prob density in some specific world (vit D helps in covid in *this* specific way).

The test result comes out as marked in red. The area to the right of that point is the p. It only depends on the black curve, so we don't need to get too specific about our hypothesis other than to know that the higher values are more likely in the hypothetical worlds.

The bayes factor would be the relative heights of the curves at the point of the result. Those depend on the specific hypotheses, and can clearly take values greater or smaller than 1 for "positive test results".

1. Did someone try to aggregate the survival expectation for both groups (patient by patient, then summed up) and control for this?

Because this is the one and main parameter.

2. Is the "previous blood pressure" strong enough a detail to explain the whole result?

3. My intuition is that this multiple comparison thing is way too dangerous an issue to go ex post and use one of the test to explain the result.

This sounds counter intuitive. But this is exactly the garden of forked path issue. Once you go after the fact to select numbers, your numbers are really meaningless.

Unless of course you happen to land on the 100% smoker example.

But!

You will need a really obvious situation, rather than a maybe parameter.

Rerolling the randomization as suggested in the post, doesn't usually work because people are recruited one-by-one on a rolling basis.

But for confounders that are known a priori, one can use stratified randomization schemes, e.g. block randomization within each stratum (preferably categories, and preferably only few). There are also more fancy "dynamic" randomization schemes that minimize heterogeneity during the randomization process, but these are generally discouraged (e.g., EMA guideline on baseline covariates, Section 4.2).

In my naive understanding, spurious effects due to group imbalance are part of the game, that is, included in the alpha = 5% of false positive findings that one will obtain in the null hypothesis testing model (for practical purposes, it's actually only 2.5% because of two-sided testing).

But one can always run sensitivity analyses with a different set of covariates, and the authors seem to have done this anyway.

I think I've read papers where the population was grouped into similar pairs and each pair was randomized. I seems to me that the important question is not so much rolling recruitment, but speed, in particular time from recruitment and preliminary measurement to randomization. Acute treatments have no time to pair people, but some trials have weeks from induction to treatment.

This is a nice trick with limited applicability in most clinical trial settings, because it requires that you know all of your subjects' relevant baseline characteristics simultaneously prior to randomizing in order to match them up. They could do that in their HIV therapy trial example because they would get "batches" of trial eligible subjects with preexisting HIV infection. In the COVID-Vit D study, and most others, subjects come one at a time, and expect treatment of some kind right away.

Nope, you could take the first person who arrives, say female between age 60 and 70, high blood pressure but no diabetes, and flip a coin to see which group she goes into. Treat her, measure what happens. Continue doing this for each new participant until you get a ‘repeat’; another female between 60 and 70, high blood pressure but no diabetes. She goes into whichever group that first woman didn’t go in.

Keep doing this until you’ve got a decent sized sample made up of these pairs. Discard the data from anyone who didn’t get paired up.

What you've described is something different than what the paper talks about though. Your solution is basically a dynamic allocation with equal weighting on HTN and diabetes as factors, and the complete randomization probability or second best probability set to 0% (Medidata Rave and probably other databases can do this pretty easily). And while it would definitely eliminate the chances of a between group imbalance on hypertension or diabetes, I still don't see it being a popular solution for two reasons. First, because the investigators know which group a subject is going to be in if they are second in the pair; second, because it's not clear ahead of time that you don't just want the larger sample size that you'd get if you weren't throwing out subjects that couldn't get matched up. It's sort of a catch-22: small trials like the Vitamin D study need all the subjects they can get, and can't afford to toss subjects for the sake of balance that probably evens out through randomization anyway; large trials can afford to do this, but don't need to, because things *will* even out by the CLT after a few thousand enrollments.

I think controlling for noise issues with regression is a fine solution for part 1. You can also ways of generating random groups subject to a restraint like "each group should have similar average Vitamin D." Pair up experimental units with similar observables, and randomly assign 1 to each group (like https://en.wikipedia.org/wiki/Propensity_score_matching but with an experimental intervention afterwards).

For question 2, isn't this what https://en.wikipedia.org/wiki/Meta-analysis is for? Given 4 confidence, intervals of varying widths and locations, you either: 1. determine the measurements are likely to be capturing different effects, and can't really be combined; or 2. generate a narrower confidence interval that summarizes all the data. I think something like random-effects meta analysis answers the question you are asking.

Secondarily, 0 effect vs some effect is not a good Bayesian hypothesis. You should treat the effect as having some distribution, which is changed by each piece of information. The location and shape can be changed by any test result; an experiment with effect near 0 moves the probability mass towards 0, while an extreme result moves it away from 0.

You shouldnt just mindlessly adjust for multiple comparisons by dividing the significance threshold by the number of tests. This Bonferroni adjustment is used to "controll the familywise error rate",(FWER), which is the probability of rejecting one hypothesis, given that they are all true null hypotheses. Are you sure that is what you want to controll for in your ambidextrois analysis? Its not abvious that is what you want.

By the way: Thanks for mentioning the "digit ratio" among other scientifically equally relevant predictors such as amount of ice hockey played, number of nose hairs, eye color, percent who own Golden Retrievers.

The easy explanation here is that the number of people randomized was so small that there was no hope of getting a meaningful difference. Remember, the likelihood of adverse outcome of COVID is well below 10% - so we're talking about 2-3 people in one group vs 4-5 in the other. In designing a trial of this sort, it's necessary to power it based on the number of expected events rather than the total number of participants.

Hmm yes, a few of the responses have suggested things like stratified randomisation and matched pairs but my immediate intuition is that n = ~75 is too small to do that with so many confounders anyway.

I think you would want to construct a latent construct out of your questions that measures 'authoritarianism', and then conduct a single test on that latent measure. Perhaps using a factor analysis or similar to try to divine the linear latent factors that exist, and perusing them manually (without looking at correlation to your response variable, just internal correlation) to see which one seems most authoritarianish. And then finally measuring the relationship of that latent construct to your response variable, in this case, ambidexterity.

This Chinese study has me flummoxed - are they saying "vitamin D has no effect on blood pressure UNLESS you are deficient, over 50, obese and have high blood pressure"?

"Oral vitamin D3 has no significant effect on blood pressure in people with vitamin D deficiency. It reduces systolic blood pressure in people with vitamin D deficiency that was older than 50 years old or obese. It reduces systolic blood pressure and diastolic pressure in people with both vitamin D deficiency and hypertension."

Maybe the Cordoba study was actually backing up the Chinese study, in that if you're older, fatter, have high blood pressure and are vitamin D deficient then taking vitamin D will help reduce your blood pressure. And reducing your blood pressure helps your chances with Covid-19.

So it's not "vitamin D against covid", it's "vitamin D against high blood pressure in certain segments of the population against covid" which I think is enough to confuse the nation.

You want to use 4 different questions from your survey to test a single hypothesis. I *think* the classical frequentist approach here would be to use Fisher's method, which tells you how to munge your p values into a single combined p: https://en.wikipedia.org/wiki/Fisher%27s_method

Fisher's method makes the fairly strong assumption that your 4 tests are independent. If this assumption is violated you may end up rejecting the null too often. A simpler approach that can avoid this assumption might be to Z-score each of your 4 survey questions and then sum the 4 Z-scores for each survey respondent. You can then just do a regular t-test comparing the mean sum-of-Z-scores between the two groups. This should have the desired effect (i.e. an increase in the power of your test by combining the info in all 4 questions) without invoking any hairy statistics.

A few people seem to have picked up on some of the key issues here, but I'll reiterate.

1. The study should have randomized with constraints to match blood pressures between the groups. This is well established methodology.

2. Much of the key tension between the different examples is really about whether the tests are independent. Bayesianism, for example, is just a red herring here.

Consider trying the same intervention at 10 different hospitals, and all of them individually have an outcome of p=0.07 +/- 0.2 for the intervention to "work". In spite of several hospitals not meeting a significance threshold, that is very strong evidence that it does, in fact, work, and there are good statistical ways to handle this (e.g. regression over pooled data with a main effect and a hospital effect, or a multilevel model etc.). Tests that are highly correlated reinforce each other, and modeled correctly, that is what you see statistically. The analysis will give a credible interval or p-value or whatever you like that is much stronger than the p=0.05 results on the individual hospitals.

On the other hand, experiments that are independent do not reinforce each other. If you test 20 completely unrelated treatments, and one comes up p=0.05, you should be suspicious indeed. This is the setting of most multiple comparisons techniques.

Things are tougher in the intermediate case. In general, I like to try to use methods that directly model the correlations between treatments, but this isn't always trivial.

One thing I'm wondering: is it possible to retroactively correct for the failure to stratify or match by randomly creating matched sub-samples, and resampling multiple times? Or does that introduce other problems.

That's not too far from what they did, by trying to control for BP. It's reasonable, but it's still much better to stratify your sampling unless your sample is so big that it doesn't matter.

There's a robust literature on post-sampling matching and weighting to control for failures in confounder balance. The simplest case is exact or 1-1 matching, where subjects in the treated and control sets with identical confounder vectors are "paired off," resulting in more balanced groups. A common tool here is propensity scores, which let you quantify how similar a treated versus control subject is with many covariates, or continuous covariates.

These kinds of techniques do change the populations you can make inferences about. What commonly happens is that you end up losing or downweighting untreated subjects (think large observational studies where most people are "untreated"), and so your inference is really only applicable to the treated population. But there are ways around it of course. If you're interested here's a good overview: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3144483/

I don't see why the constraints are necessary. If you don't screw up the randomization (meaning the group partitions are actually equiprobable) the random variables describing any property of any member of one group are exactly the same as the other group, therefore if you use correct statistical procedures (eg, Fischers exact test on the 2x2 table: given vitamin d/not given vitamin D and died/alive), your final p-value should already contain the probability that, for example, every high blood-pressure person got into one group. If, instead, you use constraints on some property during randomization, who knows what that property might be correlated with? which would imo weaken the result.

Here's why: if you force equally weighted grouping into strata (or matched pairs), then if the property is correlated with the outcome, you'll have a *better* or now worse understanding of the effect than if it's randomly selected for. If it's uncorrelated with the outcome, it does nothing. But if you ignore the property, and it's correlated, you may get a lopsided assignment on that characteristic.

JASSCC pretty much got this right. It generally hurts nothing to enforce balance in the randomization across a few important variables, but reduces noise. That's why it's standard in so many trials. Consider stylized example of a trial you want to balance by sex. In an unbalanced trial, you would just flip a coin for everyone independently, and run the risk of a big imbalance happening by chance. Or, you can select a random man for the control group, then one for the treatment. Then select a random woman for control, then one for treatment, etc. Everything is still randomized, but all the populations are balanced and you have removed a key source of noise.

On the 'what could go wrong' point- what could go wrong is that in ensuring that your observable characteristics are nicely balanced, you've imported an assumption about their relation to characteristics that you cannot observe- so you're saying the subset of possible draws in which observable outcomes is balanced is also the subset where unobservable outcomes is balanced, which is way stronger than your traditional conditional independence assumption.

I think that some of the confusion here is on the difference between the Family-Wise Error Rate (FWER) and the False Discovery Rate (FDR). The first is the probability under the null hypothesis of getting at least one false positive. The second is the expected proportion of false positives over all positives. Assuming some level of independent noise, we should expect the FWER to increase with the number of tests if the power of our test is kept constant (this is not difficult to prove in a number of settings using standard concentration inequalities). The FDR, however, we can better control. Intuitively this is because one "bad" apple will not ruin the barrel as the expectation will be relatively insensitive to a single test as the number of tests gets large.

Contrary to what Scott claims in the post, Holmes-Bonferroni does *not* require independence of tests because its proof is a simple union bound. I saw in another comment that someone mentioned the Benjamini-Hochberg rule as an alternative. This *does* (sometimes) require independent tests (more on this below) and bounds the FDR instead of the FWER. One could use the Benjamini-Yekutieli rule (http://www.math.tau.ac.il/~ybenja/MyPapers/benjamini_yekutieli_ANNSTAT2001.pdf) that again bounds the FDR but does *not* require independent tests. In this case, however, this is likely not powerful enough as it bounds the FDR in general, even in the presence of negatively correlated hypotheses.

To expand on the Benjamini-Hochberg test, we acutally do not need independence and a condition in the paper I linked above suffices (actually, a weaker condition of Lehmann suffices). Thus we *can* apply Benjamini-Hochberg to Scott's example, assuming that we have this positive dependency. Thus, suppose that we have 99 tests with a p-value of .04 and one with a p-value of .06. Then applying Benjamini-Hochberg would tell us that we can reject the first 80 tests with a FDR bounded by .05. This seems to match Scott's (and my) intuition that trying to test the same hypothesis in multiple different ways should not hurt our ability to measure an effect.

Sorry, I made a mistake with my math. I meant so say in the last paragraph that we would reject the first 99 tests under the BH adjustment and maintain a FDR of .05.

It's true that Bonferroni doesn't require independence. However, in the absence of independence it can be very conservative. Imagine if you ran 10 tests that were all almost perfectly correlated. You then use p < 0.005 as your corrected version of p < 0.05. You're still controlling the error rate – you will have no more than a 5% chance of a false positive. In fact, much less! But you will make many more type II errors, because the true p value threshold should be 0.05.

Yes this is true. First of all, the Holm-Bonferroni adjustment is uniformly more powerful than the Bonferroni correction and has no assumption on the dependence structure of the tests, so Bonferroni alone is always too conservative. Second of all, this is kind of the point: there is no free lunch. If you want a test that can take advantage of a positive dependence structure and not lose too much power, after a certain point, you are going to have to pay for it by either reducing the generality of the test or weakening the criterion used to define failure. There are some options that choose the former (like Sidak's method) and others that choose the latter, by bounding FDR instead of FWER (still others do both, like the BH method discussed above).

I agree with your post, and judging from where the p-values probably came from I agree they are probably positively dependent.

FYI there is a new version of the BH procedure which adjusts for known dependence structure. It doesn’t require positive dependence but is uniformly more powerful than BH under positive dependence:

I work in big corp which makes money by showing ads. We run thouthands AB-tests yearly, and here are our the standard ways to deal with such problems:

1. If you suspect beforehand that there would be multiple significant hypothesis, run experiment with two control groups i.e. AAB experiment. Then disregard all alternatives which doesn't significant in both comparision A1B and A2B

2. If you run AB experiment and have multiple significant hypothesis, rerun experiment and only pay attention to hypothesis which were significant in previous experiment.

I am not statistician, so I'm unsure if it's formally correct

I work in a small corp which makes money by showing ads. No, your methods are not formally correct. Specifically, for #1, you're just going halfway to an ordinary replication, and thus only "adjusting" the significance threshold by some factor which may-or-may-not be enough to account for the fact that you're doing multiple comparisons, and for #2, you're just running multiple comparisons twice, with the latter experiment having greater power for each hypothesis tested thanks to the previous adjusted-for-multiple-comparisons prior.

All that is to say- the math is complicated, and your two methods *strengthen* the conclusions, but they by no means ensure that you are reaching any particular significance level or likelihood ratio or whatever, since that would depend on how many hypotheses, how large the samples, and so on.

Thanks for clarification. Do you have some special people who make decisions in AB tests? If not I would be really interested in hearing your standard practices for AB testing

Generally, the decisions are "pre-registered"- we've defined the conditions under which the test will succeed or fail, and we know what we'll do in response to either state, so noone is really making a decision during/after the test. All of the decisions come during test design- we do have a data scientist on staff who helps on larger or more complicated designs, but generally we're running quite simple A/B tests: Single defined goal metric, some form of randomization between control and test (which can get slightly complicated for reasons pointed out in the OP!), power calculation done in advance to ensure the test is worth running, and an evaluation after some pre-defined period.

Usually we do replications only when A) requested, or B) enough time has passed that we suspect the underlying conditions might change the outcome, or C) the test period showed highly unusual behavior compared to "ordinary" periods.

We've been considering moving to a more overtly Bayesian approach to interpreting test results and deciding upon action thresholds (as opposed to significance-testing), but haven't done so yet.

Can I ask why the big corp doesn't employ a statistician/similar to do this? I'm not trying to be snarky, but if they're investing heavily in these experiments it seems weird to take such a loose approach.

It does employ several analysts which as I know have at least passable knowledge of statistics. As for why the our AB-testing isn't formally correct I have several hypothesis:

1. A lot of redundancy - we have literally thousands of metrics in each AB-test, so any reasonable hypothesis would involve at least 3-5 metrics - therefore we're unlikely to have false positive. If on the other hand we have AB-test with not significant results of something we believe in, we would often redo experiment, which decrease chances of false negative. So from bayesian point of view we are kind of alright - we just inject some of our beliefs into AB-tests

2. It's really hard to detect statistical problems in human decision (all AB-tests have human-written verdicts) because majority of human decisions have multiple justifications. Furthermore it's even harder to calculate damage from subtly wrong decision making - would we redo experiment if results would be a bit different? Did it cost us anything if we mistakenly approved design A which isn't really different from design B? A lot of legibility problems here

3. Deployment of analysts is very skewed by department - I know that some departments have 2-3x number of analysts more than we have. Maybe in such departments culture of experimentation is better.

>Metacelsus mentions the Holmes-Bonferroni method. If I’m understanding it correctly, it would find the ten-times-replicated experiment above significant. But I can construct another common-sensically significant version that it wouldn't find significant - in fact, I think all you need to do is have ninety-nine experiments come back p = 0.04 and one come back 0.06.

About the Holm-Bonferroni method:

How it works, is that you order the p-values from smallest to largest, and then compute a threshold for significance for each position in the ranking. The threshold formula is: α / (number of tests – rank + 1), where α is typically 0.05.

Then the p-values are compared to the threshold, in order. If the p-value is less than the threshold the null hypothesis is rejected. As soon as one is above the threshold, that one, and all subsequent p-values in the list, fail to reject the null hypothesis.

So for your example of 100 tests where one is 0.06 and others are all 0.04, it would come out to:

Rank p Threshold

1 0.04 0.05 / 100 = 0.0005

2 0.04 0.05 / 99 = 0.00051

....

100 0.06 0.05 / 1 = 0.05

So you're right, none of those would be considered "significant". But you'd have to be in some pretty weird circumstances to have almost all your p-values be 0.04.

The big concept here, is that this protocol controls the *familywise error rate*, which is the probability that at least one of your rejections of the null is incorrect. So it makes perfect sense that as you perform more tests, each test has to pass a stricter threshold.

What you are actually looking for is the *false discovery rate* which is the expected fraction of your rejections of the null that are incorrect. There are other formulas to calculate this. https://en.wikipedia.org/wiki/False_discovery_rate

I also forgot to mention, that these multiple hypothesis correction formulas assume that the experiments are **independent**. This assumption would be violated in your thought experiment of measuring the same thing in 100 different ways.

They actually just assume positive dependence, not independence. For example, if the family of tests is jointly Gaussian and the tests are all nonnegatively correlated, then Theorem 1.2 and Section 3.1, case 1 of http://www.math.tau.ac.il/~ybenja/MyPapers/benjamini_yekutieli_ANNSTAT2001.pdf imply that that we can still use Benjamini Hochberg and bound the FDR. In particular, this should hold if the tests are all measuring the same thing, as in Scott's example.

"by analogy, suppose you were studying whether exercise prevented lung cancer. You tried very hard to randomize your two groups, but it turned out by freak coincidence the "exercise" group was 100% nonsmokers, and the "no exercise" group was 100% smokers."

But don't the traditionalists say that this is a feature, not a bug, of randomization? That if unlikely patterns appear through random distribution this is merely mirroring the potential for such seemingly nonrandom grouping in real life? I mean this is obviously a very extreme example for argumentative purposes, but I've heard people who are really informed about statistics (unlike me) say that when you get unexpected patterns from genuinely randomization, hey, that's randomization.

About p thresholds, you've pretty much nailed it by saying that simple division works only if the tests are independent. And that is pretty much the same reason why the 1:1 likelihood ratio can't be simply multiplied by the others and give the posterior odds. This works only if the evidence you get from the different questions is independent (see Jaynes's PT:LoS chap. 4 for reference)

What if there was a standard that after you randomize, you try to predict as well as possible which group is more likely to naturally perform better, and then you make *that* the treatment group? Still flawed, but feels like a way to avoid multiple rolls while also minimizing the chance of a false positive (of course assuming avoiding false positives is more important than avoiding false negatives).

Did you get this backwards? It seems like if you're trying to minimize the false positive, you'd make the *control group* the one with a better "natural" performance result. That being said, in most of these sorts of trials, it's unclear of the helpfulness of many of the potential confounders.

Yep... Thanks for pointing that out, flip what I said.

Re:unclear: I guess I'm picturing that if people are gonna complain after the fact, you'd hope it would have been clear before the fact? If people are just trying to complain then that doesn't work, but if we assume good faith it seems plausible? Maybe you could make it part of preregistration (the kind where you get peer reviewed before you run the study), where you describe the algorithm you'll use to pick which group should be the control group, and the reviewers can decide if they agree with the weightings. Once you get into territory where it isn't clear, I'd hope the after-the-fact-complaints should be fairly mild?

The Bonferroni correction has always bugged me philosophically for reasons vaguely similar to all this. Merely reslicing and dicing the data ought not, I don’t think, ruin the chances for any one result to be significant. But then again I believe what the p-hacking people all complain about, so maybe we should just agree that 1/20 chance is too common to be treated as unimpeachable scientific truth!

Hm. I think to avoid the first problem, divide your sample into four groups, and do the experiment twice, two groups at a time. If you check for 100 confounders, you get an average of 5 in each group, but an average 0.25 confounders in both, so with any luck you can get the statistics to tell you whether any of the confounders made a difference (although if you didn't increase N you might not have a large enough experiment in each half).

The Bayesian analysis only really works if you have multiple *independent* tests of the same hypothesis. For example, if you ran your four tests using different survey populations it might be reasonable. However, since you used the same survey results you should expect the results to be correlated and thus multiplying the confidence levels is invalid.

As an extreme version of this, suppose that you just ran the *same* test 10 times, and got a 2:1 confidence level. You cannot conclude that this is actually 1000:1 just by repeating the same numbers 10 times.

Combining data from multiple experiments testing the same hypothesis to generate a more powerful test is a standard meta analysis. Assuming the experiments are sufficiently similar you effectively paste all the data sets together, and then calculate the p value for the combined data set. With a bit of maths you can do this using only the effect sizes (d), and standard errors (s) from each experiment (you create 1/s^2 copies of d for each experiment and then run the p value). The reason none of the other commenters have suggested this as a solution to your ambidexterity problem (being ambidexterous isn't a problem! Ahem, anyway) is that you haven't given enough data - just the p values instead of the effect sizes and standard errors. I tried to get this from the link you give on the ambidexterity post to the survey, but it gave me seperate pretty graphs instead of linked data I could do the analysis on. However, I can still help you by making an assumption: Assuming that the standard error across all 4 questions is the same (s), since they come from the same questionarrie and therefore likely had similar numbers of responders, we can convert the p values into effect sizes - the differences (d) using the inverse normal function:

1. p = 0.049 => d = 1.65s

2. p = 0.008 => d = 2.41s

3. p = 0.48 => d = 0.05s

4. p = 0.052 => d = 1.63s

We then average these to get the combined effect size d_c = 1.43s. However all the extra data has reduced the standard error of our combined data set. We're assuming all experiments had the same error, so this is just like taking an average of 4 data points from the same distribution - i.e. the standard error is divided by sqrt(n). In this case we have 4 experiments, so the combined error (s_c) is half what we started with. Our combined measure is therefore 1.43s/0.5s = 2.86 standard deviations above our mean => combined p value of 0.002

Now this whole analysis depends on us asserting that these 4 experiments are basically repeats of the same experiment. Under than assumption, you should be very sure of your effect!

If the different experiments had different errors we would create a weighted average with the lower error experiments getting higher weightings (1/s^2 - the inverse of the squared standard error as this gets us back to the number of data points that went into that experiement that we're pasting into our 'giant meta data set'), and similarly create a weighted average standard error (sqrt(1/sum(1/s^2))).

Blast. You're quite right. This problem is called pseudoreplication. My limited understanding is that our options are:

1. Average the answers people gave to each of the 4 questions to create one, hopefully less noisy score of 'Authoritarianess' for each real replicate (independent person answering the questionnaire), and then run the standard significance test on that.

2. Develop a hierarchical model which describes the form of the dependence between the different answers, and then perform a marginally more powerful analysis on this.

Yes. But Scott should be able to estimate the covariance: I am guessing these 4 p-values are all coming from two sample t-tests on four correlated responses. The correlations between the differences between the four means should be* the same as the individual-level within-sample correlations of the four responses, which Scott can calculate with access to the individual-level data.

I mean the right thing to do is probably to combine all four tests into one big test. Perhaps the most general thing you could do is to treat the vector of empirical means as a sample from a 4-d Gaussian (that you could approximate the covariance of), and run some kind of multi-dimensional T-test on.

Do you know what the right test statistic would be though? If your test statistic is something like the sum of the means I think your test would end up being basically equivalent to Neil's option 1. If your test statistic was the maximum of the component means, I guess you could produce the appropriate test, but it'd be complicated. Assuming that you had the covariance matrix correct (doing an analogue of a z-test rather than a t-test), you'd have to perform a complicated integral in order to determine the probability that a Gaussian with that covariance matrix produces a coordinate whose mean is X larger than the true mean. If you wanted to do the correct t-test analogue, I'm not even sure how to do it.

Monte Carlo should be a good solution for either thing you want to do. Nonparametric bootstrap is also probably a bit better than the Gaussian approximation.

I don’t think it will be roughly equivalent to treating them as independent. The average of 4 correlated sample means may have a much larger variance than the average of 4 independent ones.

Not equivalent to treating them as independent. Equivalent to Neil's revision where he talks about averaging scores.

I don't think Monte Carlo works for our t-test analogue (though it works to perform the integral for the z-test). The problem is that we want to know that *whatever* the ground truth correlation matrix, we don't reject with probability more than 0.05 if the true mean is 0. Now for any given correlation matrix, we could test this empirically with Monte Carlo, but finding the worst case correlation matrix might be tricky.

Ah I might not be following all the different proposals, sorry if I messed something.

Unfortunately with global mean testing there isn’t a single best choice of test stat, but in this problem “average of the 4 se-standardized mean shifts” seems as reasonable a combined test stat as anything else.

Regarding “what is the worst case correlation matrix” I was assuming Scott would just estimate the correlation matrix from the individual level data and plug it in as an asymptotically consistent estimate. Whether or not that’s reasonable depends on the sample size, which I don’t know.

If the sample size isn’t “plenty big” then some version of the bootstrap (say, iid but stratified by ambi and non-ambi) should also work pretty well to estimate the standard error of whatever test statistic he wants to use. Have to set it up the right way, but that is a bit technical for a reply thread.

Permutation test will also work in small samples as a last resort but would not give specific inference about mean shifts.

I wouldn't worry too much about the vitamin D study.

For one thing, it is perfectly statistically correct to run the study without doing all of these extra tests, and as you point out, throwing out the study if any of these tests comes back with p-value less than 0.05 would basically mean the study can never be done.

Part of the point of randomizing is that you would expect that any confounding factors to average out. And sure, you got unlucky and people with high blood pressure ended up unevenly distributed. On the other hand, Previous lung disease and Previous cardiovascular disease ended up pretty unbalanced in the opposite direction (p very close to 1). If you run these numbers over many different possibly co-founders, you'd expect that the effects should roughly average out.

I feel like this kind of analysis is only really useful to

A) Make sure that your randomization isn't broken somehow.

B) If the study is later proved to be wrong, it helps you investigate why it might have been wrong.

"I think the problem is that these corrections are for independent hypotheses, and I'm talking about testing the same hypothesis multiple ways (where each way adds some noise). "

If you are testing the same hypothesis multiple ways, then you four variable should be correlated. In this case you can extract one composite variable from these four with a factor analysis et voilà, just on test to perform!

"Maybe I need a real hypothesis, like "there will be a difference of 5%", and then compare how that vs. the null does on each test? But now we're getting a lot more complicated than just the "call your NHST result a Bayes factor, it'll be fine!" I was promised."

Ideally what you need is prior distribution of possible hypotheses. So for example you might think before any analysis of the data that there is a 90% chance of no effect and if there is an effect you expect the size to be normally distributed about 0 with a SD of 1. Then your prior distribution on the effect size x is f(x)= .9*delta(0)+.1*N(0,1) and then if p(x) is the probability of your observation given an effect size of x you can calculate the posterior distribution of effect size as f(x)*p(x)/int(f(x)*p(x)dx)

If the express goal of the "randomization" is the resulting groups being almost equal on every conceivable axis and any deviation from equal easily invalidates any conclusion you might want to draw... is randomization maybe the worst tool to use here?

Would a constraint solving algorithm not fare much, much better, getting fed the data and then dividing them 10'000 different ways to then randomize within the few that are closest to equal on every axis?

I hear your cries: malpractice, (un-)conscious tampering, incomplete data, bugs, ... But a selection process that reliably kills a significant portion of all research depite best efforts is hugely wasteful.

How large is that portion? Cut each bell curve (one per axis) down to the acceptable peak in the middle (= equal enough along that axis) and take that to the power of the number of axes. That's a lot of avoidably potentially useless studies!

Clinical trial patients arrive sequentially or in small batches; it's not practical to gather up all of the participants and then use an algorithm to balance them all at once.

This paper discusses more or less this issue with potentially imbalanced randomization from a Bayesian decision theory perspective. The key point is that, for a researcher that wants to minimize expected loss (as in, you get payoff of 0 if you draw the right conclusion, and -1 if you draw the wrong one), there is, in general, one optimal allocation of units to treatment and control. Randomization only guarantees groups are identical in expected value ex-ante, not ex-post.

You don't want to do your randomization and find that you are in the 1% state of world where there is wild imbalance. Like you said, if you keep re-doing your randomization until everything looks balanced, that's not quite random after all. This paper says you should bite the bullet and just find the best possible treatment assignment based on characteristics you can observe and some prior about how they relate to the outcome you care about. Once you have the best possible balance between two groups, you can flip a coin to decide what is the control and what is the treatment. What matters is not that the assignment was random per se, but that it is unrelated to potential outcomes.

I don't know a lot about the details of medical RCTs, but in economics it's quite common to do stratified randomization, where you separate units in groups with similar characteristics and randomize within the strata. This is essentially taking that to the logical conclusion.

1. P-values should not be presented for baseline variables in a RCT. It is just plain illogical. We are randomly drawing two groups from the same population. How could there be a systematical difference? In the words of Doug Altman: ”Performing a significance test to compare baseline variables is to assess the probability of something having occurred by chance when we know that it did occur by chance.”

Regardless it is wrong to use p-values to choose what variables to use in the model. It is really very straight forward, just decide a priory what variables are known (or highly likely) predictors and put them in the model. Stephen Senn: ”Identify useful prognostic covariates before unblinding the data. Say you will adjust for them in the statistical analysis plan. Then do so.” (https://www.appliedclinicaltrialsonline.com/view/well-adjusted-statistician-analysis-covariance-explained)

2. As others have pointed out, Bayes factors will quantify the evidence provided by the data for two competing hypothesis. Commonly a point nil null hypothesis and a directional alternative hypothesis (”the effect is larger/smaller than 0”). A ”negative” result would not be 1:1, that is a formidably inconclusive result. Negative would be eg 1:10 vs positive 10:1.

1. Exactly. The tests reported in Scott's first table above are of hypotheses that there are differences between *population* treatment groups. They only make sense if we randomized the entire population and then sampled from these groups for our study. Which we don't. I typically report standardized differences for each factor, and leave it at that.

Regarding part 2, when your tests are positively correlated, there are clever tricks you can do with permutation tests. You resample the data, randomly permute your dependent variable, and run your tests, and collect the p values. If you do this many times, you get a distribution of the p values under the null hypothesis. You can then compare your actual p values to this. Typically, for multiple tests, you use the maximum of a set of statistic. The reference is Westfall, P.H. and Young, S.S., 1993. Resampling-based multiple testing: Examples and methods for p-value adjustment (Vol. 279). John Wiley & Sons.

This is the kind of approach I would take for tests that I expect to be correlated. Or possibly attempt to model the correlation structure among the endpoints, if I wanted to do something Bayesian.

You can't run a z-test and treat "fail to reject the null" as the same as "both are equivalent". The problem with that study is that they didn't have nearly enough power to say that the groups were "not significantly different", and relying on a statistical cutoff for each of those comparisons instead of thinking through the actual problems makes this paper useless, in my view.

Worst example: If you have 8 total patients with diabetes (5 in the smaller group and 3 in the larger) you are saying that a 3.5 fold difference in diabetes incidence rate (accounting for the group sizes) is not significant. Obviously that is the kind of thing that can only happen if you are relying on p-values to do your thinking for you, as no reasonable person would consider those groups equivalent. It's extra problematic because, coincidentally, all of those errors happened to point in the same direction (more comorbidities in the control group). This is more of the "shit happens" category of problem though, rather than a flaw in the study design.

There are lots of ways they could have accounted for differences in the study populations, which I expect they tried and it erased the effect. Regressing on a single variable (blood pressure) doesn't count... you would need to include all of them in your regression. The only reason this passed review IMO is because it was "randomized" and ideally that should take care of this kind of issue. But for this study (and many small studies) randomization won't be enough.

I put this paper in the same category of papers that created the replication crisis in social psychology: "following all the rules" and pretending that any effect you find is meaningful as long as it crosses (or in this case, doesn't cross) the magical p=0.05 threshold. The best response to such a paper should be something like "well that is almost certainly nothing, but I'd be slightly more interested in seeing a follow-up study"

You need to specify in advance which characteristics you care about ensuring proper randomization and then do block randomization on those characteristics.

There are limits to the number of characteristics you can do this for, w a given sample size / effect size power calculation.

This logic actually says that you *can* re-roll the randomization if you get a bad one --- in fact, it says you *must* do this, that certain "randomizations" are much better than other ones because they ensure balance on characteristics that you're sure you want balance on.

>Metacelsus mentions the Holmes-Bonferroni method. If I’m understanding it correctly, it would find the ten-times-replicated experiment above significant.

Unfortunately, the Holmes-Bonferroni method doesn't work that way. It always requires that one of the p-values at least be significant in the multiple comparisons sense; so at least one p-value less than 0.05/n, where n is the number of comparisons.

Its strength is that it doesn't require all the p-values to be that low. So if you have five p-values: 0.01, 0.02, 0.03, 0.04, 0.05, then all five are significant given Holmes-Bonferroni, whereas only one would be significant for the standard multiple comparisons test.

I don't claim to have any serious statistical knowledge here, but my intuitive answer is that expected evidence should be conserved.

If you believe that vitamin D reduces COVID deaths, you should expect to see a reduction in deaths in the overall group. It should be statistically significant, but that's effectively a way of saying that you should be pretty sure you're really seeing it.

If you expect there to be an overall difference, then either you expect that you should see it in most ways you could slice up the data, or you expect that the effect will be not-clearly-present in some groups but very-clearly-present in others, so that there's a clear effect overall. I think the latter case means _something_ like "some subgroups will not be significant at p < .05, but other subgroups will be significant at p < (.05 / number of subgroups)". If you pick subgroups _after_ seeing the data, your statistical analysis no longer reflects the expectation you had before doing the test.

For questions like "does ambidexterity reduce authoritarianism", you're not picking a single metric and dividing it across groups - you're picking different ways to operationalize a vague hypothesis and looking at each of them on the same group. But I think that the logic here is basically the same: if your hypothesis is about an effect on "authoritarianism", and you think that all the things you're measuring stem from or are aspects of "authoritarianism", you should either expect that you'll see an effect on each one (e.g. p = .04 on each of four measures), or that one of them will show a strong enough effect that you'll still be right about the overall impact (e.g. p = .01 on one of four measures).

For people who are giving more mathematical answers: does this intuitive description match the logic of the statistical techniques?

"Should they have checked for this right after randomizing, noticed the problem, and re-rolled their randomization to avoid it? I've never seen anyone discuss this point before. The purist in me is screaming no - if you re-roll your randomization on certain results, then it's not really random anymore, is it? But it seems harsh to force them to perform a study even though we know we'll dismiss the results as soon as we get them. If we made them check a pre-written list of confounders and re-roll until there were no significant differences on any of them, what could go wrong? I don't have a good answer to this question, but thinking about it still creeps me out."

The orthodox solution here is stratified random sampling, and it is fairly similar. For example, you might have a list of 2,500 men and 2,500 women (assume no NBs). You want to sample 500, 250 for control and 250 for intervention, and you expect that gender might be a confound. In stratified random sampling, instead of sampling 250 of each and shrugging if there's a gender difference, you choose 125 men for the control and 125 men for the intervention, and do the same for women. This way you are certain to get a balanced sample (https://www.investopedia.com/terms/stratified_random_sampling.asp, for example). While this only works with categorical data, you can always just bin continuous data until it cooperates.

The procedure is statistically sound, well-validated, and commonly practiced.

I know of one randomized experiment which matched people into pairs which were as similar as possible, then for each pair chose randomly which was treatment and which was control.

>I chose four questions that I thought were related to authoritarianism

What you can do if you're trying to measure a single thing (authoritarianism) and have multiple proxies, is to average the proxies to get a single measure, then calculate the p-value using the average measure only. I'd recommend doing that now (even though, ideally, you'd have committed to doing that ahead of time).

For the Vitamin D study, I think you are off track, as were the investigators. Trying to assess the randomness of assignment ex post is pretty unhelpful. The proper approach is to identify important confounders ahead of time and use blocking. For example, suppose that blood pressure is an important confounder. If your total n is 160, you might divide blood those 160 into four blocks of 40 each, based on blood pressure range. A very high block, a high block, a low block, and a very low block. Then randomly assign half within each block to get the treatment, so 20 very high blood pressure folks get the vitamin D and 20 do not. That way you know that blood pressure variation across subjects won't mess up the study. If there are other important confounders, you have to subdivide further. If there are so many important confounders that the blocks get to be too small to have any power, then you needed a much bigger sample in the first place.

You would want to have a data-driven method to identify the confounders that you will use to block. Otherwise, you are embedding your intuition into the process.

I don't disagree, but that means you do two sets of analysis. One to identify the confounders and another to test the hypothesis (unless I have misunderstood). Reminiscent of the training / validation approach used for ML, don't you think?

But you do not learn about the possible confounders by doing the study. You learn about them ahead of time from other research, e.g. research that shows that blood pressure interacts strongly with COVID. Blocking can be data driven in the sense that you have data from other studies showing that blood pressure matters. Of you need to do a study to learn what the confounders are, then you need to do another study after that, based on that information.

And, of course, the larger the sample, the less you have to use advanced statistics to detect an effect.

Another alternative: replicate your results. The first test, be sloppy, don't correct for multiple comparisons, just see what the data seems to say, and formulate your hypothesis(es). Then test that/those hypothesis(es) rigourously with the second test.

You can even break your single database into two random pieces, and use one to formulate the hypothesis(es) and the other to test it/them.

Significance doesn't seem like the right test here. When you are testing for significance, you are asking a question about how likely it is that differences in the sample represent differences in the wider population (more or less, technically you are asking for frequentist statistics "if I drew a sample and this variable was random, what is the chance I would see a difference at least this large"). In this case, we don't care about that question, we care if the actual difference between two groups is large enough to cause something correlated with it to show through. At the very least, the Bonferroni adjustment doesn't apply, in fact I would go in the other direction. The difference needs to be big enough that it has some correlation with the outcome strong enough to cause a spurious result.

The key point of randomized trials is NOT that they ensure balance of each possible covariate. The key point is that they make the combined effect of all imbalances zero in expectation, and they allow the statistician to estimate the variance of their treatment-effect estimator.

Put another way, randomization launders what would be BIAS (systematic error due to imbalance) into mere VARIANCE (random error due to imbalance). That does not make balance irrelevant, but it subtly changes why we want balance — to minimize variance, not because of worries about bias. If we have a load of imbalance after randomizing, we'll simply get a noisier treatment-effect estimate.

"If the groups are different to start with, then we won't be able to tell if the Vitamin D did anything or if it was just the pre-existing difference."

Mmmmmaybe. If the groups are different to start with, you get a noisier treatment-effect estimate, which MIGHT be so noisy that you can't reject the vitamin-D-did-nothing hypothesis. Or, if the covariates are irrelevant, they don't matter and everything turns out fine. Or, if vitamin D is JUST THAT AWESOME, the treatment effect will swamp the imbalance's net effect anyway. You can just run the statistics and see what numbers pop out at the end.

"Or to put it another way - perhaps correcting for multiple comparisons proves that nobody screwed up the randomization of this study; there wasn't malfeasance involved. But that's only of interest to the Cordoba Hospital HR department when deciding whether to fire the investigators."

No. It's of interest to us because if we decide that the randomization was defective, all bets are off; we can't trust how the study was reported and we don't know how the investigators might have (even if accidentally) put their thumb on the scale. If we instead convince ourselves that the randomization was OK and the study run as claimed, we're good to apply our usual statistical machinery for RCTs, imbalance or no.

"But this raises a bigger issue - every randomized trial will have this problem. [...] Check along enough axes, and you'll eventually always find a "significant" difference between any two groups; [...] if you're not going to adjust these away and ignore them, don't you have to throw out every study?"

No. By not adjusting, you don't irretrievably damage your randomized trials, you just expand the standard errors of your final results.

Basically, if you really and truly think that a trial was properly randomized, all an imbalance does is bleed the trial of some of its statistical power.

Regarding whether hypertension explains away the results, have not read the papers (maybe Jungreis/Kellis did something strictly better), but here's a simple calculation that sheds some light I think:

So 11 out of the 50 treated patients had hypertension. 39 don't.

And 15 out of the 26 control patients had hypertension. 11 don't.

You know that a total of 1 vitamin D patients were ICU'd. And 13 of the control patients were admitted.

There is no way to slice this such that the treatment effect disappears completely [I say this, having not done the calculation I have in mind to check it -- will post this regardless of what comes out of the calculation, in the interest of pre-registering and all that]

To check this, let's imagine that you were doing a stratified study, where you're testing the following 2 hypotheses simultaneously:

H1: Vitamin D reduces ICU rate among hypertension patients

H2: Vitamin D reduces ICU rate among non-hypertension patients.

Your statistical procedure is to

(i) conduct a Fisher exact test on the 11 [treated, hypertension] vs 15 [control, hypertension] patients

(ii) conduct a Fisher exact test on the 39 [treated, no-hypertension] vs 11 [control, no-hypertension] patients

(iii) multiply both by 2, to get the Bonferroni-corrected p-values; accept a hypothesis if its Bonferroni-corrected p-value is < 0.05

If we go through all possible splits* of the 1 ICU treated patient into the 11+39 hypertension+non-hypertension patients and the 13 ICU control patients into the 15+11 hypertension+non-hypertension patients (there are 24 total possible splits), the worst possible split for the "Vitamin D works" camp is if 1/11 hypertension & 0/39 non-hypertension treated patients were ICU, and 10/15 hypertension & 3/11 non-hypertension control patients were ICU.

In this case still you have a significant treatment effect (the *corrected* p-values for H1 and H2 are 0.01 and 0.02 in this case).

I don't know how kosher this is formally, but it seems like a rather straightforward & conservative way to see whether the effect still stands (and not a desperate attempt to wring significance out of a small study, hopefully), and it does seem to stand.

This should also naturally factor out the direct effect that hypertension may have on ICU admission (which seems to be a big concern).

Other kinds of uncertainty might screw things up though - https://xkcd.com/2440/ -- given the numbers, I really do think there *has* to have been some issue of this kind to explain away these results.

My suspicion is that you are reaching the limits of what is possible using statistical inference. There might be, however, alternative mathematical approaches that might provide an answer to the real question "should we prescribe X and, if so, to whom?". I refer specifically to optimisation-based robust classification / regression (see e.g. work by MIT professor Bertsimas https://www.mit.edu/~dbertsim/papers.html#MachineLearning - he's also written a book on this). But I would still worry about the sample size, it feels small to me.

For the ambidexterity question, what was your expected relation between those four questions and the hidden authoritarianism variable? Did you expect them all to move together? All move separately but the more someone got "wrong" the more authoritarian they lean? Were you expecting some of them to be strongly correlated and a few of them to be weakly correlated? All that's to ask: is one strong sub-result and three weak sub-results a "success" or a "failure" of this prediction? Without a structural theory it's hard to know what to make of any correlations.

--------------

Then, just to throw one more wrench in, suppose your background doesn't just change your views on authoritarianism, but also how likely you are to be ambidextrous.

Historically, in much of Asia and Europe teachers enforced a heavy bias against left-handedness in activities like writing. [https://en.wikipedia.org/wiki/Handedness#Negative_connotations_and_discrimination] You don't see that bias exhibited as much by Jews or Arabs [anecdotal], probably because of Hebrew and Arabic are written right-to-left. But does an anti-left-handed bias decrease ambidexterity (by shifting ambidextrous people to right-handed) or increase ambidexterity (by shifting left-handed people to ambidextrous)? Does learning language with different writing directions increase ambidexterity? Is that checkable in your data?

Most of us learned to write as children [citation needed], and since most of your readership and most of the study's respondents are adults [citation needed] they may have already been exposed to this triumph of nurture over nature. It's possible that the honest responses of either population may not reflect the natural ambidexterity rate. Diving down the rabbit hole, if the populations were subject to these pressures, what does it actually tell us about the self-reported ambidextrous crowd? Are the ambidextrous kids from conservative Christian areas the kids who fell in line with an unreasonable authority figure? Who resisted falling in line? Surely that could be wrapped up in their feelings about authoritarianism.

These examples are presented as being similar, but they have an important distinction.

I agree that testing p-values for the Vitamin D example doesn't make too much sense. However, if you did want to perform this kind of broad ranging testing, I think you should be concerned with the false discovery rate rather than the overall level of the tests. Each of these tests is, in some sense, a different hypothesis, and should receive it's own budget of alpha.

The second example tests as single hypothesis in multiple ways. Because it's a single hypothesis, it could make sense to control the overall size of the test at 0.025. However, because these outcomes are (presumably) highly correlated, you should use a method that adjusts for the correlation structure. Splitting the alpha equally among four tests is unnecessarily conservative.

deletedApr 6, 2021Comment deletedI have no idea what the actual scientists would say, but the mechanism of "Vitamin D regulates calcium absorption, and calcium is really important for the activation of the immune system, and the immune system fights off viral infections" seems like a decent guess! At least strong enough to make testing worthwhile and, given enormous D-deficiency rates in many of the tested populations, a reasonable candidate for hypothesis testing.

Shouldn't "19:1 in factor" be actually "19:1 in favor"?

Thanks, fixed.

Bayesian statistics require you to compare the chance of the outcome given the null hypothesis, Vs the chance of the outcome given h1.

So the chance of getting p=0.5 given the null hypothesis is very high, but very low given h1, so it should significantly update you towards h0.

The correct way to do it from a Bayesian perspective is to compare the probability density of the exact outcome you got given h0 Vs h1, and update according to their ratio.

In practice the intuition is that if you were 100 % sure you would get around this result given h1, then a p = 0.05 would indeed make you 20 times as certain in h1. If you expected to get a bit higher or a bit lower, it would make you less than 20 times as certain (or even reduce your certainty). If you expected to get pretty much exactly that result, it could make you thousands of times more certain.

TLDR: you have to look at both p given h0, and p given h1 to do Bayesian analysis

Right. Except that p=0.05 means 5% chance of this outcome or an outcome more extreme, so if we are talking about densities as you say and we were sure we would get this *exact* outcome in h1 world and we got it, that should make us update way more than 20 times.

Isn't that what I wrote?

> If you expected to get pretty much exactly that result, it could make you thousands of times more certain.

Yeah, I misread. But I still don't get what you meant by "if you were 100 % sure you would get around this result given h1, then a p = 0.05 would indeed make you 20 times as certain in h1.". Are you just describing a common wrong intuition, or claiming that the intuition makes sense if you were sure your result would be "around" in h1? If so, why does that make sense?

Yeah I wasn't very precise there. What I think I meant is that for some distributions of h1 it would be around 20 times higher. I think it's too late to edit my post, but if I could I would delete that.

Maybe, if in h1 it was around 100% sure that one would get a result equal or more extreme than the actual result, then one would get that 20 factor, right?

This clearly wouldn't be using ALL the information one has. It feels weird though to choose what subset of the information to use and which to ignore when updating depending on the result.

I need to study this.

^ this. your test 3 should not give you a 1:1 Bayes factor

because the libertarian result is pretty likely if the authoritarianism thesis is false, (that's where the p-value is coming from) but unlikely if the authoritarianism thesis is true. That's why you can't do any of this with just p-values.

The prior probability of h0 is 0 (no pair of interesting variables in a complex system have zero association) so in a Bayesian analysis no data should update you in favor (or against) it.

You have to look at probability densities not probabilities

I don't get that--probabilities are the thing that has real world interpretation we care about. In a Bayesian analysis it just makes more sense to think in terms of distributions of effect sizes than null hypothesis testing.

True Bayesian analysis is impossible. It requires enumerating the set of all possible universes which are consistent with the evidence you see, and calculation what percentage have feature X youre interested in. This is an approximation of Bayesian analysis which updates you towards either h0 (these variables are not strongly linked) and h1 (these variables are strongly linked).

"All possible universes" is just the 4D hypercube representing the possible true differences in group means for the four questions considered. Given a prior on that space it is straightforward to calculate the posterior from the data. The only tricky part is figuring out how to best represent your prior.

Yes Bayesian techniques are exactly meant to deal with issues like this.

The key is to have a large enough study such that if there are 20 potentially relevant factors, even though one of them will probably show a significant difference between groups, that difference will be too small to explain any difference in results. Here the study was tiny, so one group had 25% high blood pressure and the other over 50%.

Yeah, as a reminder, the fluctuations are inversely proportional to the square root of the number of random processes.

This is a very good point. However it's not always possible to just "run a bigger study". (In this case, sure, that's probably the right answer). Is it possible to do better with the small sample-size that we have here?

For example, I'm wondering if you can define a constraint on your randomized sampling process that prevents uneven distributions -- "reroll if it looks too biased" in Scott's phrasing, or possibly reweighting the probabilities of drawing each participant based on the population that's been drawn already. Essentially you want the sampled groups to have a distribution in each dimension that matches the sampled population, and that match each other. I'm sure this is something that's common in clinical trial design, and I'll try to dig up some citations for how this is done in the industry.

For example see "quota sampling" and "stratified sampling" here (quick google search so no validation on quality): https://www.healthknowledge.org.uk/public-health-textbook/research-methods/1a-epidemiology/methods-of-sampling-population, which seems relevant, but doesn't go deep enough to analyze when you'd need to use these techniques.

What you're referring to is called blocking. At the most extreme you pair each person to one other person in such a way that the expected imbalance after randomization is minimized.

Good career advice: never work with children, animals, or small sample sizes.

Just to clarify, since I'm not totally sure I get the argument, is the idea here that as sample size increases, you are less likely to end up with a large difference in the groups (e.g., in this case, you are less likely to end up with wildly different proportions of people with high blood pressure)?

I think it's worth noting that testing for statistically significant differences in covariates is nonsensical. If you are randomizing appropriately, you know that the null hypothesis is true - the groups differ only because of random chance. If you end up with a potentially important difference between the groups, it doesn't really matter if it's statistically significant or not. Rather, it matters to whatever extent the difference influences the outcome of interest.

In this case, if blood pressure is strongly related to covid recovery, even with a large sample size and a small difference between the groups, it would be a good idea to adjust for blood pressure, simply because you know it matters and you know the groups differ with respect to it.

Note that, while it's true that larger sample sizes will be less likely to produce dramatic differences in covariates between randomized treatment and control groups, larger sample sizes will also increase the chance that even small differences in relevant covariates will be statistically significantly related to differences in the outcome of interest, by virtue of increasing the power to detect even small effects.

On the topic of your point in part I, this seems like a place where stratified sampling would help. Essentially, you run tests on all of your patients, and then perform your random sampling in such a way that the distribution of test results for each of the two subgroups is the same. This becomes somewhat more difficult to do the more different types of tests you run, but it shouldn't be an insurmountable barrier.

It's worth mentioning that I've seen stratified sampling being used in some ML applications for exactly this reason - you want to make sure your training and test sets share the same characteristics.

I was going to propose something like this, unaware it already had a name. I imagine it's harder to stratify perfectly when have 14 conditions and 76 people but it would be the way to go.

If someone hasn't already, someone should write a stratification web tool that doctors or other non-statisticians can just plug in "Here's a spreadsheet of attributes, make two groups as similar as possible". It should just be something you are expected to do, like doing a power calculation before setting up your experiment.

This is a thing. It's called minimisation or adaptive randomisation. It works, and should be recommended more often for small studies like this.

I am a couple of days late, but yes, this is what I would suggest, with caveat I don't have much formal studies in experimental design. I believe in addition to "stratified sampling", another search key word one wants is "blocking".

Intuitively, when considering causality, it makes sense. One hypothesizes that two things, confounding factor A and intervention B could both have an effect, and one is interested in the effect of the intervention B. To be on the safe side

The intuitive reason here is that randomization is not magic, but serves a purpose: reducing statistical bias from unknown confounding factors.

Abstracted lesson I have learned from several practical mistakes over the years: Suppose I have 1000 data items, say, math exams scores of students from all schools in a city. As far as I am concerned, I do not know if there is any structure in the data, so I call it random and I split the samples into two groups of 500, first 50 in one and rest in another? What if later I have been told that the person who collected the data reported them in a spatial order of the school districts as they parsed the records, forgot to include the district information, but now tells me the first 500 students includes pupils from the affluent district with the school that has nationally renowned elite math program? Unknown factor has now become known factor, while the data and sample order remained the same! Doing randomization by yourself is the way I could have avoided this kind of bias. But if the person tells me the districts of the students beforehand, I can actually take this in account while designing my statistical model.

Were those really the only questions you considered for the ambidexterity thing? I had assumed, based on the publication date, that it was cherry-picked to hell and back.

No, it was completely legit.

Guess I should reread it taking it seriously this time. (Not the first time I've had to do that with something published Apr 1)

Please forgive me for remaining a teeny bit suspicious.

There's good reason to be suspicious of the result (as I'm sure Scott would agree - one blogger doing one test with one attempted multiple-hypothesis test correction just isn't that strong evidence, especially when it points the opposite direction of the paper he was trying to replicate).

I think there's no reason to be suspicious that Scott was actually doing an April Fools - it just wasn't a good one, and Scott tends to do good twist endings when he has a twist in mind.

Agreed, though that post would have benefited a lot from being posted on another day (or have some kind of not-an-april-fools-joke disclaimer).

I agree with you. I was asking for forgiveness because despite there being no reason for suspicion, I found that I was, indeed, still suspicious.

One comment here: if you have a hundred different (uncorrelated) versions of the hypothesis, it would be *hella weird* if they all came back around p=.05. just by random chance, if you'd expect any individual one of them to come back at p=0.05, then if you run 100 of them you'd expect them one of them to be unusually lucky and get back at p=.0005 (and another one to be unusually unlucky and end up at p=0.5 or something). Actually getting 100 independent results at p=.05 is too unlikely to be plausible.

Of course, IRL you don't expect these to be independent - you expect both the underlying thing you're trying to predict and the error sources to be correlated across them. This is where it gets messy - you sort of have to guess how correlated these are across your different metrics (e.g. high blood pressure would probably be highly correlated with cholesterol or something, but only weakly correlated with owning golden retrievers). And that I'm itself is kind of a judgement call, which introduces another error source.

You can empirically determine the correlation between those different metrics (e.g., R^2) and correct the multiple hypothesis test statistic.

What I generally use and see used for multiple testing correction is the Benjamini-Hochberg method - my understanding is that essentially it looks at all the p-values and for each threshold X it compares to how many p-values<X you got vs how many you'd expect by chance, and adjusts them up depending on how "by chance" they look based on that. In particular, in your "99 0.04 results and one 0.06" example it only adjusts the 0.04s to 0.040404.

But generally speaking you're right, all those methods are designed for independent tests, and if you're testing the same thing on the same data, you have no good way of estimating how dependent one result is on another so you're sort of screwed unless you're all right with over-correcting massively. (Really what you're supposed to do is pick the "best" method by applying however many methods you want to one dataset and picking your favorite result, and then replicate that using just that method on a new dataset. Or split your data in half to start with. But with N<100 it's hard to do that kind of thing.)

For a bayesian analysis, it's hard to treat "has an effect" as a hypothesis and you probably don't want to. You need to treat each possible effect size as a hypothesis, and have a probability density function rather than a probability for both prior and posterior.

You *could* go the other way if you want to mess around with dirac delta functions, but you don't want that. Besides, the probability that any organic chemical will have literally zero effect on any biological process is essentially zero to begin with.

I want a like button for this comment.

And the effect size is what you actually want to know, if you are trying to balance the benefits of a drug against side effects or a of social programme against costs.

Has there already been any formal work done on generating probability density functions for statistical inferences, instead of just picking an overly specific estimate? This seems like you're on to something big.

Meteorology and economics, for example, return results like "The wind Friday at 4pm will be 9 knots out of the northeast" and "the CPI for October will be 1.8% annualized", which inevitably end up being wrong almost all of the time. If I'm planning a party outside Friday at 4pm I'd much rather see a graph showing the expected wind magnitude range than a single point estimate.

The concept you want is Conjugate Priors.

If the observations all follow a certain form, and the prior follows the conjugate form, the posterior will also follow the conjugate form, just with different parameters. Most conjugate forms can include an extremely uninformative version, such as a gaussian with near-infinite variance.

The gaussian's conjugate prior is itself, which is very convenient.

The conjugate prior for bernoulli data (a series of yes or nos) is called a Beta Distribution (x^a(1-x)^b) and is probably what we want here. It's higher-dimensional variation is called a Dirichlet Distribution.

This might interest you:

https://arxiv.org/abs/2010.09209

Very cool, I'll dig in, thanks both

This is true, Scott's "Bayes factors" are completely the wrong ones. It's also true that the "real" Bayes factors are hard to calculate and depend on your prior of what the effect is, but just guessing a single possible effect size and calculating them will give you a pretty good idea — usually much better than dumb old hypothesis testing.

For example, let's assume that the effect size is 2 SD. This way any result 1 SD and up will give a Bayes factor in favor of the effect, anything 1 SD or below (or negative) will be in favor of H0. Translating the p values to SD results we get:

1. p = 0.049 ~ 2 SD

2. p = 0.008 ~ 2.65

3. p = 0.48 ~ 0

4. p = 0.052 ~ 2

For each one, we compare the height of the density function at the result's distance from 0 and from 2 (our guessed at effect), for example using NORMDIST(result - hypothesis,0,1,0) in Excel.

1. p = 0.049 ~ 2 SD = .4 density for H1 and .05 density for H0 = 8:1 Bayes factor in favor.

2. p = 0.008 ~ 2.65 = .32 (density at .65) and .012 (density at 2.65) = 27:1 Bayes factor.

3. p = 0.48 ~ 0 = 1:8 Bayes factor against

4. p = 0.052 ~ 2 = 8:1 Bayes factor in favor

So overall you get 8*8*27/8 = 216 Bayes factor in favor of ambidexterity being associated with authoritarian answers. This is skewed a bit high by a choice of effect size that hit exactly on the results of the two of the questions, but it wouldn't be much different if we chose 1.5 or 2.5 or whatever. If we used some prior distribution of possible positive effect sizes from 0.1 to infinity maybe the factor would be 50:1 instead of 216:1, but still pretty convincing.

I don't understand what you are doing here. Aren't you still double-counting evidence by treating separate experiments as independent? After you condition on the first test result, the odds ratio for the other ones has to go down because you are now presumably more confident that the 2SD hypothesis is true.

You're making a mistake here. p = 0.049 is nearly on the nose if the true effect size is 2 SD *and you have one sample*. But imagine you have a hundred samples, then p = 0.049 is actually strong evidence in favor of the null hypothesis. In general the standard deviation of an average falls like the square root of the sample size. So p = 0.049 is like 2/sqrt(100) = .2 standard deviations away from the mean, not 2.

Not directly answering the question at hand, but there's a good literature that examines how to design/build experiments. One paper by Banerjee et al. (yes, the recent Nobel winner) finds that you can rerandomize multiple times for balance considerations without much harm to the performance of your results: https://www.aeaweb.org/articles?id=10.1257/aer.20171634

So this experiment likely _could_ have been designed in a way that doesn't run into these balance problems

There is also a new paper by the Spielman group at Yale that examines how to balance covariates: https://arxiv.org/pdf/1911.03071.pdf

So this is fine in studies on healthy people where you can give all of them the drug on the same day. If you’re doing a trial on a disease that’s hard to diagnose then the recruitment is staggered and rerandomizing is impossible.

In the paper I linked above, Appendix A directly speaks to your concern on staggered recruitment. Long story short, sequential rerandomization is still close to first-best and should still be credible if you do it in a sensible manner.

On the ambidextrous analysis, I think what you want for a Bayesian approach is to say, before you look at the results of each of the 4 variables, how much you think a success or failure in one of them updates your prior on the others. e.g. I could imagine a world where you thought you were testing almost exactly the same thing and the outlier was a big problem (and the 3 that showed a good result really only counted as 1.epsilon good results), and I could also imagine a world in which you thought they were actually pretty different and there was therefore more meaning to getting three good results and less bad meaning to one non-result. Of course the way I've defined it, there's a clear incentive to say they are very different (it can only improve the likelihood of getting a significant outcome across all four), so you'd have to do something about that. But I think the key point is you should ideally have said something about how much meaning to read across the variables before you looked at all of them.

Oops I missed that Yair said a similar thing but in a better and more mathy way

"I don't think there's a formal statistical answer for this."

Matched pairs? Ordinarily only gender and age, but in principal you can do matched pairs on arbitrarily many characteristics. You will at some point have a hard time making matches, but if you can match age and gender in, say 20 significant buckets, and then you have say five binary health characteristics you think might be significant, you would have about 640 groups. You'd probably need a study of thousands to feel sure you could at least approximately pair everyone off.

Hmm, wondering if you can do a retrospective randomized matched pairs subset based on random selections from the data (or have a computerized process to do a large number of such sub-sample matchings). Retrospectively construct matched pair groups on age, gender and blood pressure. Randomize the retrospective groupings to the extent possible. Re-run the analysis. Redo many times to explore the possible random subsets.

Isn't the point to put each member of a matched pair into the control or the intervention group? If so you'd need to do it first (i.e. in Scott's example before administering the vitamin D).

Not necessarily so. You could I think take an experiment that hadn't been constructed as a matched pair trial and then choose pairs that match on your important variables, one from treatment, one from control, as a sub-sample. If for any given pair you choose there are multiple candidates in the treatment and control group to select, you can "randomize" by deciding which two candidates constitute a pair. Now you've got a subsample of your original sample consisting of random matched pairs, each of which has one treatment and one control.

Post hoc you have a set of records, each of which has values for control (yes/no), age, blood pressure, etc., and outcome. You can then calculate a matching score on whichever parameters you like for each pair of intervention-control subjects. Then for each intervention subject you can rank the control subjects by descending match score and select the top 1 (or more if desired). This is matching with replacement, but could also be done without replacement.

There are many ways to achieve this after the fact is what I am saying. None are perfect but most are reasonable.

I'm not convinced there is actually an issue. Whenever we get a positive result in any scientific experiment there is always *some* chance that the result we get will be random chance rather than because of a real effect. All of this debate seems to be about analyzing a piece of that randomness and declaring it to be a unique problem.

If we do our randomization properly, on average, some number of experiments will produce false results, but we knew this already. It is not a new problem. That is why we need to be careful to never put too much weight in a single study. The possibility of these sorts of discrepancies is a piece of that issue, not a new issue. The epistemological safeguards we already have in place handle it without any extra procedure to specifically try to counter it.

deletedApr 6, 2021Comment deletedMatched pair studies seem fine, and I see the benefit of them, I am just not convinced they are always necessary, and they can't solve this problem.

Ultimately there are an arbitrary number of possible confounding variables, and no matter how much matching you do, there will always be some that "invalidate" your study. You don't even know which are truly relevant. If you did humanity would be done doing science.

If you were able to do matched pair study that matched everything, not just age and gender, you would have to be comparing truly identical people. At that point, your study would be incredibly powerful, it would be something fundamentally better than an RCT, but obviously this is impossible.

In any given study one starts with the supposition that some things have a chance of being relevant. For example, you may think some treatment, such as administering vitamin D, has a chance of preventing an illness or some of its symptoms. There are other things that are outside what you can control or affect that you also think may well be relevant, though you hope not too much. And finally there are an unlimited number of things that you think are very unlikely or that you have no reason to believe would be relevant but that you cannot rule out.

You seem to be suggesting everything in the second category should be treated as though it were in the third category, or else everything in the third category ought to be treated as though it belonged in the second category. But the world is not like that. It is not the case that the existence of things in the third category means that there are always more things that should have been in the second category. The two categories are not the same.

Although the third category is larger than the second category, the second category is also practically unlimited. Also, the exact location of the border between the two is subjective.

It seems like it would be impossible to explicitly correct for every possible element of the second category, but if you don't, it isn't clear that you are accomplishing very much.

It is not practically unlimited. At any given time there will typically be a small finite number of confounders of serious importance. Blood pressure is an obvious one. You are making a slippery slope argument against dealing with confounders, but there's no slope, much less a slippery one.

It seems to me to be a huge number. I am considering:

- Preexisting medical conditions

- Age, gender, race/ethnicity

- Other drugs/medical care (both for COVID and for preexisting conditions)

- Environmental factors

- Every possible permutation of previously elements on the list

Which do you think aren't actually relevant?

I am not making a slippery slope argument. I'm not sure if you are misinterpreting my comment, or misusing the name of the argument, but either way, you are incorrect. If you clarify what you meant, I will explain in more detail.

The idea behind matched pair studies is that the pairing evens out Althea known potentially confounding variables, while the randomization within the pair (coin toss between the two as to who goes into experimental group and who into control) should take care of the unknown ones.

It's not a new problem, but stratifications exist to solve it. There are solutions to this issue.

Stratification are also a valuable tool, but I am not convinced they are necessary either, and using them inherently introduces p-hacking concerns and weakens the available evidence.

Both stratification and matching exist to deal with problems like this. Maybe a lot of times it's not necessary, but this post is entirely about a case where it might be: because of the high blood pressure confounder. I'm not sure what to make of the idea that the solution is worse than the problem: why do you think that?

I do not believe that the solution is worse than the problem, at least not in such an absolute sense.

What I actually believe is that the solution is not generally necessary. I also believe that in most situations where the experiment has already been done, the fact that one of these solutions had not been applied shouldn't have a significant impact on our credence about the study.

This is an awful lot of weight to put on your beliefs or opinions about the matter.

I don't understand what you are trying to say here. How much weight do you think I am putting on my beliefs and opinions?

There's a difference between saying "we know some of them will do this" and "we have pretty good evidence it was this one in particular"

Clarification question:

When you say "some of them", do you mean some of the studies, or some of the variables? I think you meant some of the studies, so I am going to respond to that. Please correct me if I am wrong.

I think that this is happening in (almost?) all of the studies. It is just a question of if we happen to notice the particular set of variables that it is happening for. I think that the section of TFA about golden retrievers is a little bit misleading. Even considering only variables that could conceivably be relevant, there are still a nearly infinite number of possible variables. The question of if it gets noticed for a particular study is arbitrary, and more related to which variables happened to get checked than the strength of the study itself.

I would agree; when we say "there is a 5% chance of getting this result by random chance", then this is exactly the sort of scenario which is included in that five percent.

But what is currently doing my head in is this: once we know that there _is_ a significant difference between test and control groups in a potentially-significant variable, are we obliged to adjust our estimate that the observed effect might be due to random chance?

And if we are, then are we obliged to go out fishing for every other possible difference between our test and control groups?

I agree with you. I think there are decent arguments on both sides.

Arguments in favor of doing these corrections:

- Once we identify studies where there is a difference in a significant variable, that means that this particular study is more likely to be the result of chance

- Correcting can only improve our accuracy because we are removing questions where the randomization is causing interference.

Arguments against doing the corrections:

- There is always a significant variable that is randomized poorly (because of how many there are), when we notice it, that tells us more about what we can notice than it does about the study itself.

- A bunch of these sorts of errors counter each other out. Removing some of them is liable to have unforeseen consequences.

- Ignoring the issue altogether doesn't do any worse on average. Trying to correct in only specific instances could cause problems if you aren't precise about it.

It seems complicated enough that I don't fully know which way is correct, but I am leaning against making the corrections.

If your original confidence level included the possibility that the study might have had some bias that you haven't checked for, then in principle, when you subsequently go fishing for biases, every bias you find should give you less faith in the study, BUT every bias that you DON'T find should give you MORE faith. You are either eliminating the possible worlds where that bias happened or eliminating the possible worlds where it didn't happen.

Of course, counting up all the biases that your scrutiny could've found (but didn't) is hard.

I don't think it quite works like that. The possible biases are equally likely to help you or hurt you. Before you actually examine them their expected impact is 0 (assuming you are running a proper RCT). Every bias that you find that find that helped the result resolve as it did would give you less credence in that result. Every bias that was in the opposite direction would give you more. The biases you don't know about should average out.

Wouldn't you a posteriori expect to find a positive impact from bias given that the result was positive?

I guess that depends on your priors on the effect size?

You’re right. This is a good point. It is only 0 if you are talking about all studies attempted.

You basically never want to be trying to base your analysis on combined P factors directly. You want to--as you said--combine together the underlying data sets and create a P factor on that. Or, alternately, treat each as a meta-data-point with some stdev of uncertainty and then find the (lower) stdev of their combined evidence. (Assuming they're all fully non-independent in what they're trying to test.)

Good thoughts on multiple comparisons. I made the same point (and a few more points) in a 2019 Twitter thread: https://twitter.com/stuartbuck1/status/1176635971514839041

A good way to handle the problem of failed randomizations is with re-weighting based on propensity scores (or other similar methods, but PSs is the most common). In brief, you use your confounders to predict the probability of having received the treatment, and re-weight the sample depending on the predicted probabilities of treatment. The end result of a properly re-balanced sample is that, whatever the confouding effect of blood pressure on COVID-19 outcomes, it confounds both treated and untreated groups with equal strength (in the same direction). Usually you see this method talked about in terms of large observational data sets, but it's equally applicable to anything (with the appropriate statistical and inferential caveats). Perfectly balanced data sets, like from a randomized complete block design, have constant propensity scores by construction, which is just another way of saying they're perfectly balanced across all measured confounders.

For the p-value problem, whoever comes in and talks about doing it Bayesian I believe is correct. I like to think of significance testing as a sensitivity/specificity/positive predictive value problem. A patient comes in from a population with a certain prevalence of a disease (aka a prior), you apply a test with certain error statistics (sens/spec), and use Bayes' rule to compute the positive predictive value (assuming it comes back positive, NPV otherwise). If you were to do another test, you would use the old PPV in place of the original prevalence, and do Bayes again. Without updating your prior, doing a bunch of p-value based inferences is the same as applying a diagnostic test a bunch of different times without updating your believed probability that the person has the disease. This is clearly nonsense, and for me at least it helps to illustrate the error in the multiple hypothesis test setting.

Finally, seeing my name in my favorite blog has made my day. Thank you, Dr. Alexander.

> Should they have checked for this right after randomizing, noticed the problem, and re-rolled their randomization to avoid it?

The point of "randomizing" is to drown out factors that we don't know about. But given that we know that blood pressure is important, it's insane to throw away that information and not use it to divide the participant set.

I think the proper way to do this might be stratified sampling [1]. Divide the population into all relevant subgroups that you know about and then sample from each subgroup at the same rate to fill your two groups.

[1]: https://en.wikipedia.org/wiki/Stratified_sampling

A very simple test I'd like to see would be to re-run the analysis with all high-blood-pressure patients removed. (Maybe that's what they did when 'controlling for blood pressure' - or maybe they used some statistical methodology. The simple test would be hard to get wrong.)

If you did that you'd have only eleven patients left in the control group, which I'm gonna wildly guess would leave you with something statistically insignificant.

Well, that's the trouble with small tests. If the blood pressure confound is basically inseparable, and blood pressure is a likely Covid signifier, the result is just not all that strong for Vitamin D. One's priors won't get much of a kick.

[Should it be posteriors that get a kick? Seems like a pun that we'd have heard before...]

Adaptive randomisation does this but preserves the unpredictability of assignment from the perspective of the investigator: https://www.hilarispublisher.com/open-access/a-general-overview-of-adaptive-randomization-design-for-clinical-trials-2155-6180-1000294.pdf

>I think the proper way to do this might be stratified sampling [1]. Divide the population into all relevant subgroups that you know about and then sample from each subgroup at the same rate to fill your two groups.

I don't think that works for an n=76 study with fifteen identified cofounders, or at least the simple version doesn't. You can't just select a bunch of patients from the "high blood pressure" group and then from the "low blood pressure" group, and then from the "over 60" group and the "under 60 group", and then the "male" group and the "female group", etc, etc, because each test subject is a member of three of those groups simultaneously. By the time you get to e.g. the "men over 60 with low blood pressure" group, with n=76 each of your subgroups has an average of 9.5 members. With fifteen cofounders to look at, you've got 32,768 subgroups, each of which has an average of 0.002 members.

If you can't afford to recruit at least 65,000 test subjects, you're going to need something more sophisticated than that.

This topic is actually well discussed among randomista econometricians. I believe they used to advise "rerolling" until you get all confounders to be balanced, but later thought it might create correlations or selection on *unobservable* confounders, so weakly advised against it.

I agree that stratification of some sort is what I would try.

For something more sophisticated, see these slides[1] by Chernozhukov which suggest something called post-double-selection.

The post-double-selection method is to select all covariates that predict either treatment assignment or the outcome by some measure of prediction (t-test, Lasso, ...). Then including those covariates in the final regression, and using the same confidence intervals.

[1] https://stuff.mit.edu/~vchern/papers/Chernozhukov-Saloniki.pdf

Regarding Bayes. If a test (that is independent of the other tests) gives a bayes a factor of 1:1, then that means that the test tells you nothing. Like, if you tested the Vitamin D thing by tossing a coin. It's no surprise that it doesn't change anything.

If a test for vit D and covid has lots of participants and the treatment group doesn't do better, that's not a 1:1 bayes factor for any hypotheses where vit d helps significantly. The result is way less likely to happen in the world where vit d helps a lot.

People, correct me if I'm mistaken or missing the point. This is the image in my mind:

https://ibb.co/vcqW9HQ

The horizontal axis are the results of a test. In black, the probability density in the null world (vit D doesn't help for covid). In blue, the prob density in some specific world (vit D helps in covid in *this* specific way).

The test result comes out as marked in red. The area to the right of that point is the p. It only depends on the black curve, so we don't need to get too specific about our hypothesis other than to know that the higher values are more likely in the hypothetical worlds.

The bayes factor would be the relative heights of the curves at the point of the result. Those depend on the specific hypotheses, and can clearly take values greater or smaller than 1 for "positive test results".

1. Did someone try to aggregate the survival expectation for both groups (patient by patient, then summed up) and control for this?

Because this is the one and main parameter.

2. Is the "previous blood pressure" strong enough a detail to explain the whole result?

3. My intuition is that this multiple comparison thing is way too dangerous an issue to go ex post and use one of the test to explain the result.

This sounds counter intuitive. But this is exactly the garden of forked path issue. Once you go after the fact to select numbers, your numbers are really meaningless.

Unless of course you happen to land on the 100% smoker example.

But!

You will need a really obvious situation, rather than a maybe parameter.

Rerolling the randomization as suggested in the post, doesn't usually work because people are recruited one-by-one on a rolling basis.

But for confounders that are known a priori, one can use stratified randomization schemes, e.g. block randomization within each stratum (preferably categories, and preferably only few). There are also more fancy "dynamic" randomization schemes that minimize heterogeneity during the randomization process, but these are generally discouraged (e.g., EMA guideline on baseline covariates, Section 4.2).

In my naive understanding, spurious effects due to group imbalance are part of the game, that is, included in the alpha = 5% of false positive findings that one will obtain in the null hypothesis testing model (for practical purposes, it's actually only 2.5% because of two-sided testing).

But one can always run sensitivity analyses with a different set of covariates, and the authors seem to have done this anyway.

I think I've read papers where the population was grouped into similar pairs and each pair was randomized. I seems to me that the important question is not so much rolling recruitment, but speed, in particular time from recruitment and preliminary measurement to randomization. Acute treatments have no time to pair people, but some trials have weeks from induction to treatment.

FYI, you can next-to-guarantee that the treatment and control groups will be balanced across all relevant factors by using blocking or by using pair-matching. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4318754/#:~:text=In%20randomized%20trials%2C%20pair%2Dmatching,best%20n%2F2%20matched%20pairs.

See also https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3483834

This is a nice trick with limited applicability in most clinical trial settings, because it requires that you know all of your subjects' relevant baseline characteristics simultaneously prior to randomizing in order to match them up. They could do that in their HIV therapy trial example because they would get "batches" of trial eligible subjects with preexisting HIV infection. In the COVID-Vit D study, and most others, subjects come one at a time, and expect treatment of some kind right away.

Nope, you could take the first person who arrives, say female between age 60 and 70, high blood pressure but no diabetes, and flip a coin to see which group she goes into. Treat her, measure what happens. Continue doing this for each new participant until you get a ‘repeat’; another female between 60 and 70, high blood pressure but no diabetes. She goes into whichever group that first woman didn’t go in.

Keep doing this until you’ve got a decent sized sample made up of these pairs. Discard the data from anyone who didn’t get paired up.

What you've described is something different than what the paper talks about though. Your solution is basically a dynamic allocation with equal weighting on HTN and diabetes as factors, and the complete randomization probability or second best probability set to 0% (Medidata Rave and probably other databases can do this pretty easily). And while it would definitely eliminate the chances of a between group imbalance on hypertension or diabetes, I still don't see it being a popular solution for two reasons. First, because the investigators know which group a subject is going to be in if they are second in the pair; second, because it's not clear ahead of time that you don't just want the larger sample size that you'd get if you weren't throwing out subjects that couldn't get matched up. It's sort of a catch-22: small trials like the Vitamin D study need all the subjects they can get, and can't afford to toss subjects for the sake of balance that probably evens out through randomization anyway; large trials can afford to do this, but don't need to, because things *will* even out by the CLT after a few thousand enrollments.

I think controlling for noise issues with regression is a fine solution for part 1. You can also ways of generating random groups subject to a restraint like "each group should have similar average Vitamin D." Pair up experimental units with similar observables, and randomly assign 1 to each group (like https://en.wikipedia.org/wiki/Propensity_score_matching but with an experimental intervention afterwards).

For question 2, isn't this what https://en.wikipedia.org/wiki/Meta-analysis is for? Given 4 confidence, intervals of varying widths and locations, you either: 1. determine the measurements are likely to be capturing different effects, and can't really be combined; or 2. generate a narrower confidence interval that summarizes all the data. I think something like random-effects meta analysis answers the question you are asking.

Secondarily, 0 effect vs some effect is not a good Bayesian hypothesis. You should treat the effect as having some distribution, which is changed by each piece of information. The location and shape can be changed by any test result; an experiment with effect near 0 moves the probability mass towards 0, while an extreme result moves it away from 0.

You shouldnt just mindlessly adjust for multiple comparisons by dividing the significance threshold by the number of tests. This Bonferroni adjustment is used to "controll the familywise error rate",(FWER), which is the probability of rejecting one hypothesis, given that they are all true null hypotheses. Are you sure that is what you want to controll for in your ambidextrois analysis? Its not abvious that is what you want.

My former employer Medidata offers software-as-a-service (https://www.medidata.com/en/clinical-trial-products/clinical-data-management/rtsm) that lets you ensure that any variable you thought of in advance gets evenly distributed during randomization. The industry term is https://en.wikipedia.org/wiki/Stratification_(clinical_trials)

By the way: Thanks for mentioning the "digit ratio" among other scientifically equally relevant predictors such as amount of ice hockey played, number of nose hairs, eye color, percent who own Golden Retrievers.

Made my day <3

The easy explanation here is that the number of people randomized was so small that there was no hope of getting a meaningful difference. Remember, the likelihood of adverse outcome of COVID is well below 10% - so we're talking about 2-3 people in one group vs 4-5 in the other. In designing a trial of this sort, it's necessary to power it based on the number of expected events rather than the total number of participants.

Hmm yes, a few of the responses have suggested things like stratified randomisation and matched pairs but my immediate intuition is that n = ~75 is too small to do that with so many confounders anyway.

I think you would want to construct a latent construct out of your questions that measures 'authoritarianism', and then conduct a single test on that latent measure. Perhaps using a factor analysis or similar to try to divine the linear latent factors that exist, and perusing them manually (without looking at correlation to your response variable, just internal correlation) to see which one seems most authoritarianish. And then finally measuring the relationship of that latent construct to your response variable, in this case, ambidexterity.

I am now extremely confused (as distinct from my normal state of mildly confused), because I looked up the effects of Vitamin D on blood pressure.

According to a few articles and studies from cursory Googling, vitamin D supplementation will:

(1) It might reduce your blood pressure. Or it might not. It's complicated https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5356990/

(2) Won't have any effect on your blood pressure but may line your blood vessels with calcium like a furred kettle, so you shouldn't take it. Okay, it's kinda helpful for women against osteoporosis, but nothing doing for men https://health.clevelandclinic.org/high-blood-pressure-dont-take-vitamin-d-for-it-video/

(3) Have no effect https://www.ahajournals.org/doi/10.1161/01.hyp.0000182662.82666.37

This Chinese study has me flummoxed - are they saying "vitamin D has no effect on blood pressure UNLESS you are deficient, over 50, obese and have high blood pressure"?

https://journals.lww.com/md-journal/fulltext/2019/05100/the_effect_of_vitamin_d3_on_blood_pressure_in.11.aspx

"Oral vitamin D3 has no significant effect on blood pressure in people with vitamin D deficiency. It reduces systolic blood pressure in people with vitamin D deficiency that was older than 50 years old or obese. It reduces systolic blood pressure and diastolic pressure in people with both vitamin D deficiency and hypertension."

Maybe the Cordoba study was actually backing up the Chinese study, in that if you're older, fatter, have high blood pressure and are vitamin D deficient then taking vitamin D will help reduce your blood pressure. And reducing your blood pressure helps your chances with Covid-19.

So it's not "vitamin D against covid", it's "vitamin D against high blood pressure in certain segments of the population against covid" which I think is enough to confuse the nation.

You want to use 4 different questions from your survey to test a single hypothesis. I *think* the classical frequentist approach here would be to use Fisher's method, which tells you how to munge your p values into a single combined p: https://en.wikipedia.org/wiki/Fisher%27s_method

Fisher's method makes the fairly strong assumption that your 4 tests are independent. If this assumption is violated you may end up rejecting the null too often. A simpler approach that can avoid this assumption might be to Z-score each of your 4 survey questions and then sum the 4 Z-scores for each survey respondent. You can then just do a regular t-test comparing the mean sum-of-Z-scores between the two groups. This should have the desired effect (i.e. an increase in the power of your test by combining the info in all 4 questions) without invoking any hairy statistics.

Btw. these folks are willing to bet $100k on that Vitamin D significantly reduces ICU admissions. https://blog.rootclaim.com/treating-covid-19-with-vitamin-d-100000-challenge/

A few people seem to have picked up on some of the key issues here, but I'll reiterate.

1. The study should have randomized with constraints to match blood pressures between the groups. This is well established methodology.

2. Much of the key tension between the different examples is really about whether the tests are independent. Bayesianism, for example, is just a red herring here.

Consider trying the same intervention at 10 different hospitals, and all of them individually have an outcome of p=0.07 +/- 0.2 for the intervention to "work". In spite of several hospitals not meeting a significance threshold, that is very strong evidence that it does, in fact, work, and there are good statistical ways to handle this (e.g. regression over pooled data with a main effect and a hospital effect, or a multilevel model etc.). Tests that are highly correlated reinforce each other, and modeled correctly, that is what you see statistically. The analysis will give a credible interval or p-value or whatever you like that is much stronger than the p=0.05 results on the individual hospitals.

On the other hand, experiments that are independent do not reinforce each other. If you test 20 completely unrelated treatments, and one comes up p=0.05, you should be suspicious indeed. This is the setting of most multiple comparisons techniques.

Things are tougher in the intermediate case. In general, I like to try to use methods that directly model the correlations between treatments, but this isn't always trivial.

One thing I'm wondering: is it possible to retroactively correct for the failure to stratify or match by randomly creating matched sub-samples, and resampling multiple times? Or does that introduce other problems.

That's not too far from what they did, by trying to control for BP. It's reasonable, but it's still much better to stratify your sampling unless your sample is so big that it doesn't matter.

There's a robust literature on post-sampling matching and weighting to control for failures in confounder balance. The simplest case is exact or 1-1 matching, where subjects in the treated and control sets with identical confounder vectors are "paired off," resulting in more balanced groups. A common tool here is propensity scores, which let you quantify how similar a treated versus control subject is with many covariates, or continuous covariates.

These kinds of techniques do change the populations you can make inferences about. What commonly happens is that you end up losing or downweighting untreated subjects (think large observational studies where most people are "untreated"), and so your inference is really only applicable to the treated population. But there are ways around it of course. If you're interested here's a good overview: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3144483/

I don't see why the constraints are necessary. If you don't screw up the randomization (meaning the group partitions are actually equiprobable) the random variables describing any property of any member of one group are exactly the same as the other group, therefore if you use correct statistical procedures (eg, Fischers exact test on the 2x2 table: given vitamin d/not given vitamin D and died/alive), your final p-value should already contain the probability that, for example, every high blood-pressure person got into one group. If, instead, you use constraints on some property during randomization, who knows what that property might be correlated with? which would imo weaken the result.

Here's why: if you force equally weighted grouping into strata (or matched pairs), then if the property is correlated with the outcome, you'll have a *better* or now worse understanding of the effect than if it's randomly selected for. If it's uncorrelated with the outcome, it does nothing. But if you ignore the property, and it's correlated, you may get a lopsided assignment on that characteristic.

Sorry, that should read: "better (or no worse) understanding"

JASSCC pretty much got this right. It generally hurts nothing to enforce balance in the randomization across a few important variables, but reduces noise. That's why it's standard in so many trials. Consider stylized example of a trial you want to balance by sex. In an unbalanced trial, you would just flip a coin for everyone independently, and run the risk of a big imbalance happening by chance. Or, you can select a random man for the control group, then one for the treatment. Then select a random woman for control, then one for treatment, etc. Everything is still randomized, but all the populations are balanced and you have removed a key source of noise.

On the 'what could go wrong' point- what could go wrong is that in ensuring that your observable characteristics are nicely balanced, you've imported an assumption about their relation to characteristics that you cannot observe- so you're saying the subset of possible draws in which observable outcomes is balanced is also the subset where unobservable outcomes is balanced, which is way stronger than your traditional conditional independence assumption.

I think I understand the problem with your approach to combining Bayes factors. You can only multiply them like that if they are conditionally indepent (see https://en.wikipedia.org/wiki/Conditional_independence).

In this case, you're looking for P(E1 and E2|H), where E1, E2 are the results or your two experiments and H is your hypothesis.

Now, generally P(E1 and E2|H) != P(E1|H) * P(E2|H).

If you knew e.g. P(E1 | E2, H), you could calculate P(E1 and E2|H) = P(E1 | E2, H) * P(E2|H).

I think that some of the confusion here is on the difference between the Family-Wise Error Rate (FWER) and the False Discovery Rate (FDR). The first is the probability under the null hypothesis of getting at least one false positive. The second is the expected proportion of false positives over all positives. Assuming some level of independent noise, we should expect the FWER to increase with the number of tests if the power of our test is kept constant (this is not difficult to prove in a number of settings using standard concentration inequalities). The FDR, however, we can better control. Intuitively this is because one "bad" apple will not ruin the barrel as the expectation will be relatively insensitive to a single test as the number of tests gets large.

Contrary to what Scott claims in the post, Holmes-Bonferroni does *not* require independence of tests because its proof is a simple union bound. I saw in another comment that someone mentioned the Benjamini-Hochberg rule as an alternative. This *does* (sometimes) require independent tests (more on this below) and bounds the FDR instead of the FWER. One could use the Benjamini-Yekutieli rule (http://www.math.tau.ac.il/~ybenja/MyPapers/benjamini_yekutieli_ANNSTAT2001.pdf) that again bounds the FDR but does *not* require independent tests. In this case, however, this is likely not powerful enough as it bounds the FDR in general, even in the presence of negatively correlated hypotheses.

To expand on the Benjamini-Hochberg test, we acutally do not need independence and a condition in the paper I linked above suffices (actually, a weaker condition of Lehmann suffices). Thus we *can* apply Benjamini-Hochberg to Scott's example, assuming that we have this positive dependency. Thus, suppose that we have 99 tests with a p-value of .04 and one with a p-value of .06. Then applying Benjamini-Hochberg would tell us that we can reject the first 80 tests with a FDR bounded by .05. This seems to match Scott's (and my) intuition that trying to test the same hypothesis in multiple different ways should not hurt our ability to measure an effect.

Sorry, I made a mistake with my math. I meant so say in the last paragraph that we would reject the first 99 tests under the BH adjustment and maintain a FDR of .05.

It's true that Bonferroni doesn't require independence. However, in the absence of independence it can be very conservative. Imagine if you ran 10 tests that were all almost perfectly correlated. You then use p < 0.005 as your corrected version of p < 0.05. You're still controlling the error rate – you will have no more than a 5% chance of a false positive. In fact, much less! But you will make many more type II errors, because the true p value threshold should be 0.05.

Yes this is true. First of all, the Holm-Bonferroni adjustment is uniformly more powerful than the Bonferroni correction and has no assumption on the dependence structure of the tests, so Bonferroni alone is always too conservative. Second of all, this is kind of the point: there is no free lunch. If you want a test that can take advantage of a positive dependence structure and not lose too much power, after a certain point, you are going to have to pay for it by either reducing the generality of the test or weakening the criterion used to define failure. There are some options that choose the former (like Sidak's method) and others that choose the latter, by bounding FDR instead of FWER (still others do both, like the BH method discussed above).

I agree with your post, and judging from where the p-values probably came from I agree they are probably positively dependent.

FYI there is a new version of the BH procedure which adjusts for known dependence structure. It doesn’t require positive dependence but is uniformly more powerful than BH under positive dependence:

https://arxiv.org/abs/2007.10438

The answer to this is Regularized Regression with Poststratification: see here -- https://statmodeling.stat.columbia.edu/2018/05/19/regularized-prediction-poststratification-generalization-mister-p/

I work in big corp which makes money by showing ads. We run thouthands AB-tests yearly, and here are our the standard ways to deal with such problems:

1. If you suspect beforehand that there would be multiple significant hypothesis, run experiment with two control groups i.e. AAB experiment. Then disregard all alternatives which doesn't significant in both comparision A1B and A2B

2. If you run AB experiment and have multiple significant hypothesis, rerun experiment and only pay attention to hypothesis which were significant in previous experiment.

I am not statistician, so I'm unsure if it's formally correct

I work in a small corp which makes money by showing ads. No, your methods are not formally correct. Specifically, for #1, you're just going halfway to an ordinary replication, and thus only "adjusting" the significance threshold by some factor which may-or-may-not be enough to account for the fact that you're doing multiple comparisons, and for #2, you're just running multiple comparisons twice, with the latter experiment having greater power for each hypothesis tested thanks to the previous adjusted-for-multiple-comparisons prior.

All that is to say- the math is complicated, and your two methods *strengthen* the conclusions, but they by no means ensure that you are reaching any particular significance level or likelihood ratio or whatever, since that would depend on how many hypotheses, how large the samples, and so on.

Thanks for clarification. Do you have some special people who make decisions in AB tests? If not I would be really interested in hearing your standard practices for AB testing

Generally, the decisions are "pre-registered"- we've defined the conditions under which the test will succeed or fail, and we know what we'll do in response to either state, so noone is really making a decision during/after the test. All of the decisions come during test design- we do have a data scientist on staff who helps on larger or more complicated designs, but generally we're running quite simple A/B tests: Single defined goal metric, some form of randomization between control and test (which can get slightly complicated for reasons pointed out in the OP!), power calculation done in advance to ensure the test is worth running, and an evaluation after some pre-defined period.

Usually we do replications only when A) requested, or B) enough time has passed that we suspect the underlying conditions might change the outcome, or C) the test period showed highly unusual behavior compared to "ordinary" periods.

We've been considering moving to a more overtly Bayesian approach to interpreting test results and deciding upon action thresholds (as opposed to significance-testing), but haven't done so yet.

Can I ask why the big corp doesn't employ a statistician/similar to do this? I'm not trying to be snarky, but if they're investing heavily in these experiments it seems weird to take such a loose approach.

It does employ several analysts which as I know have at least passable knowledge of statistics. As for why the our AB-testing isn't formally correct I have several hypothesis:

1. A lot of redundancy - we have literally thousands of metrics in each AB-test, so any reasonable hypothesis would involve at least 3-5 metrics - therefore we're unlikely to have false positive. If on the other hand we have AB-test with not significant results of something we believe in, we would often redo experiment, which decrease chances of false negative. So from bayesian point of view we are kind of alright - we just inject some of our beliefs into AB-tests

2. It's really hard to detect statistical problems in human decision (all AB-tests have human-written verdicts) because majority of human decisions have multiple justifications. Furthermore it's even harder to calculate damage from subtly wrong decision making - would we redo experiment if results would be a bit different? Did it cost us anything if we mistakenly approved design A which isn't really different from design B? A lot of legibility problems here

3. Deployment of analysts is very skewed by department - I know that some departments have 2-3x number of analysts more than we have. Maybe in such departments culture of experimentation is better.

>Metacelsus mentions the Holmes-Bonferroni method. If I’m understanding it correctly, it would find the ten-times-replicated experiment above significant. But I can construct another common-sensically significant version that it wouldn't find significant - in fact, I think all you need to do is have ninety-nine experiments come back p = 0.04 and one come back 0.06.

About the Holm-Bonferroni method:

How it works, is that you order the p-values from smallest to largest, and then compute a threshold for significance for each position in the ranking. The threshold formula is: α / (number of tests – rank + 1), where α is typically 0.05.

Then the p-values are compared to the threshold, in order. If the p-value is less than the threshold the null hypothesis is rejected. As soon as one is above the threshold, that one, and all subsequent p-values in the list, fail to reject the null hypothesis.

So for your example of 100 tests where one is 0.06 and others are all 0.04, it would come out to:

Rank p Threshold

1 0.04 0.05 / 100 = 0.0005

2 0.04 0.05 / 99 = 0.00051

....

100 0.06 0.05 / 1 = 0.05

So you're right, none of those would be considered "significant". But you'd have to be in some pretty weird circumstances to have almost all your p-values be 0.04.

The big concept here, is that this protocol controls the *familywise error rate*, which is the probability that at least one of your rejections of the null is incorrect. So it makes perfect sense that as you perform more tests, each test has to pass a stricter threshold.

What you are actually looking for is the *false discovery rate* which is the expected fraction of your rejections of the null that are incorrect. There are other formulas to calculate this. https://en.wikipedia.org/wiki/False_discovery_rate

Yes, exactly. I made that point above with respect to what factors go into the decision on how to adjust P-values for multiple comparisons.

I also forgot to mention, that these multiple hypothesis correction formulas assume that the experiments are **independent**. This assumption would be violated in your thought experiment of measuring the same thing in 100 different ways.

They actually just assume positive dependence, not independence. For example, if the family of tests is jointly Gaussian and the tests are all nonnegatively correlated, then Theorem 1.2 and Section 3.1, case 1 of http://www.math.tau.ac.il/~ybenja/MyPapers/benjamini_yekutieli_ANNSTAT2001.pdf imply that that we can still use Benjamini Hochberg and bound the FDR. In particular, this should hold if the tests are all measuring the same thing, as in Scott's example.

Cool, I didn't know that it still held under positive dependence

"by analogy, suppose you were studying whether exercise prevented lung cancer. You tried very hard to randomize your two groups, but it turned out by freak coincidence the "exercise" group was 100% nonsmokers, and the "no exercise" group was 100% smokers."

But don't the traditionalists say that this is a feature, not a bug, of randomization? That if unlikely patterns appear through random distribution this is merely mirroring the potential for such seemingly nonrandom grouping in real life? I mean this is obviously a very extreme example for argumentative purposes, but I've heard people who are really informed about statistics (unlike me) say that when you get unexpected patterns from genuinely randomization, hey, that's randomization.

Isn’t that the idea of Poisson clumping?

About p thresholds, you've pretty much nailed it by saying that simple division works only if the tests are independent. And that is pretty much the same reason why the 1:1 likelihood ratio can't be simply multiplied by the others and give the posterior odds. This works only if the evidence you get from the different questions is independent (see Jaynes's PT:LoS chap. 4 for reference)

re: Should they reroll their randomization

What if there was a standard that after you randomize, you try to predict as well as possible which group is more likely to naturally perform better, and then you make *that* the treatment group? Still flawed, but feels like a way to avoid multiple rolls while also minimizing the chance of a false positive (of course assuming avoiding false positives is more important than avoiding false negatives).

Did you get this backwards? It seems like if you're trying to minimize the false positive, you'd make the *control group* the one with a better "natural" performance result. That being said, in most of these sorts of trials, it's unclear of the helpfulness of many of the potential confounders.

Yep... Thanks for pointing that out, flip what I said.

Re:unclear: I guess I'm picturing that if people are gonna complain after the fact, you'd hope it would have been clear before the fact? If people are just trying to complain then that doesn't work, but if we assume good faith it seems plausible? Maybe you could make it part of preregistration (the kind where you get peer reviewed before you run the study), where you describe the algorithm you'll use to pick which group should be the control group, and the reviewers can decide if they agree with the weightings. Once you get into territory where it isn't clear, I'd hope the after-the-fact-complaints should be fairly mild?

The Bonferroni correction has always bugged me philosophically for reasons vaguely similar to all this. Merely reslicing and dicing the data ought not, I don’t think, ruin the chances for any one result to be significant. But then again I believe what the p-hacking people all complain about, so maybe we should just agree that 1/20 chance is too common to be treated as unimpeachable scientific truth!

Hm. I think to avoid the first problem, divide your sample into four groups, and do the experiment twice, two groups at a time. If you check for 100 confounders, you get an average of 5 in each group, but an average 0.25 confounders in both, so with any luck you can get the statistics to tell you whether any of the confounders made a difference (although if you didn't increase N you might not have a large enough experiment in each half).

When Scott first mentioned the Cordoba study (https://astralcodexten.substack.com/p/covidvitamin-d-much-more-than-you) I commented (https://astralcodexten.substack.com/p/covidvitamin-d-much-more-than-you#comment-1279684) that it seemed suspect because some of the authors of that study, Gomez and Boullion, were also involved in another Spanish Vitamin D study several months later that had major randomization issues (see https://twitter.com/fperrywilson/status/1360944814271979523?s=20 for explanation). Now, The Lancet has removed the later study and is investigating it: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3771318.

The Bayesian analysis only really works if you have multiple *independent* tests of the same hypothesis. For example, if you ran your four tests using different survey populations it might be reasonable. However, since you used the same survey results you should expect the results to be correlated and thus multiplying the confidence levels is invalid.

As an extreme version of this, suppose that you just ran the *same* test 10 times, and got a 2:1 confidence level. You cannot conclude that this is actually 1000:1 just by repeating the same numbers 10 times.

Combining data from multiple experiments testing the same hypothesis to generate a more powerful test is a standard meta analysis. Assuming the experiments are sufficiently similar you effectively paste all the data sets together, and then calculate the p value for the combined data set. With a bit of maths you can do this using only the effect sizes (d), and standard errors (s) from each experiment (you create 1/s^2 copies of d for each experiment and then run the p value). The reason none of the other commenters have suggested this as a solution to your ambidexterity problem (being ambidexterous isn't a problem! Ahem, anyway) is that you haven't given enough data - just the p values instead of the effect sizes and standard errors. I tried to get this from the link you give on the ambidexterity post to the survey, but it gave me seperate pretty graphs instead of linked data I could do the analysis on. However, I can still help you by making an assumption: Assuming that the standard error across all 4 questions is the same (s), since they come from the same questionarrie and therefore likely had similar numbers of responders, we can convert the p values into effect sizes - the differences (d) using the inverse normal function:

1. p = 0.049 => d = 1.65s

2. p = 0.008 => d = 2.41s

3. p = 0.48 => d = 0.05s

4. p = 0.052 => d = 1.63s

We then average these to get the combined effect size d_c = 1.43s. However all the extra data has reduced the standard error of our combined data set. We're assuming all experiments had the same error, so this is just like taking an average of 4 data points from the same distribution - i.e. the standard error is divided by sqrt(n). In this case we have 4 experiments, so the combined error (s_c) is half what we started with. Our combined measure is therefore 1.43s/0.5s = 2.86 standard deviations above our mean => combined p value of 0.002

Now this whole analysis depends on us asserting that these 4 experiments are basically repeats of the same experiment. Under than assumption, you should be very sure of your effect!

If the different experiments had different errors we would create a weighted average with the lower error experiments getting higher weightings (1/s^2 - the inverse of the squared standard error as this gets us back to the number of data points that went into that experiement that we're pasting into our 'giant meta data set'), and similarly create a weighted average standard error (sqrt(1/sum(1/s^2))).

This should only be valid if the four tests are independent of each other. Since they are using the same sample, this analysis is probably invalid.

Blast. You're quite right. This problem is called pseudoreplication. My limited understanding is that our options are:

1. Average the answers people gave to each of the 4 questions to create one, hopefully less noisy score of 'Authoritarianess' for each real replicate (independent person answering the questionnaire), and then run the standard significance test on that.

2. Develop a hierarchical model which describes the form of the dependence between the different answers, and then perform a marginally more powerful analysis on this.

Yes. But Scott should be able to estimate the covariance: I am guessing these 4 p-values are all coming from two sample t-tests on four correlated responses. The correlations between the differences between the four means should be* the same as the individual-level within-sample correlations of the four responses, which Scott can calculate with access to the individual-level data.

* depending on modeling assumptions

Interesting....

I mean the right thing to do is probably to combine all four tests into one big test. Perhaps the most general thing you could do is to treat the vector of empirical means as a sample from a 4-d Gaussian (that you could approximate the covariance of), and run some kind of multi-dimensional T-test on.

Do you know what the right test statistic would be though? If your test statistic is something like the sum of the means I think your test would end up being basically equivalent to Neil's option 1. If your test statistic was the maximum of the component means, I guess you could produce the appropriate test, but it'd be complicated. Assuming that you had the covariance matrix correct (doing an analogue of a z-test rather than a t-test), you'd have to perform a complicated integral in order to determine the probability that a Gaussian with that covariance matrix produces a coordinate whose mean is X larger than the true mean. If you wanted to do the correct t-test analogue, I'm not even sure how to do it.

Monte Carlo should be a good solution for either thing you want to do. Nonparametric bootstrap is also probably a bit better than the Gaussian approximation.

I don’t think it will be roughly equivalent to treating them as independent. The average of 4 correlated sample means may have a much larger variance than the average of 4 independent ones.

Not equivalent to treating them as independent. Equivalent to Neil's revision where he talks about averaging scores.

I don't think Monte Carlo works for our t-test analogue (though it works to perform the integral for the z-test). The problem is that we want to know that *whatever* the ground truth correlation matrix, we don't reject with probability more than 0.05 if the true mean is 0. Now for any given correlation matrix, we could test this empirically with Monte Carlo, but finding the worst case correlation matrix might be tricky.

Ah I might not be following all the different proposals, sorry if I messed something.

Unfortunately with global mean testing there isn’t a single best choice of test stat, but in this problem “average of the 4 se-standardized mean shifts” seems as reasonable a combined test stat as anything else.

Regarding “what is the worst case correlation matrix” I was assuming Scott would just estimate the correlation matrix from the individual level data and plug it in as an asymptotically consistent estimate. Whether or not that’s reasonable depends on the sample size, which I don’t know.

If the sample size isn’t “plenty big” then some version of the bootstrap (say, iid but stratified by ambi and non-ambi) should also work pretty well to estimate the standard error of whatever test statistic he wants to use. Have to set it up the right way, but that is a bit technical for a reply thread.

Permutation test will also work in small samples as a last resort but would not give specific inference about mean shifts.

I wouldn't worry too much about the vitamin D study.

For one thing, it is perfectly statistically correct to run the study without doing all of these extra tests, and as you point out, throwing out the study if any of these tests comes back with p-value less than 0.05 would basically mean the study can never be done.

Part of the point of randomizing is that you would expect that any confounding factors to average out. And sure, you got unlucky and people with high blood pressure ended up unevenly distributed. On the other hand, Previous lung disease and Previous cardiovascular disease ended up pretty unbalanced in the opposite direction (p very close to 1). If you run these numbers over many different possibly co-founders, you'd expect that the effects should roughly average out.

I feel like this kind of analysis is only really useful to

A) Make sure that your randomization isn't broken somehow.

B) If the study is later proved to be wrong, it helps you investigate why it might have been wrong.

"I think the problem is that these corrections are for independent hypotheses, and I'm talking about testing the same hypothesis multiple ways (where each way adds some noise). "

If you are testing the same hypothesis multiple ways, then you four variable should be correlated. In this case you can extract one composite variable from these four with a factor analysis et voilà, just on test to perform!

"Maybe I need a real hypothesis, like "there will be a difference of 5%", and then compare how that vs. the null does on each test? But now we're getting a lot more complicated than just the "call your NHST result a Bayes factor, it'll be fine!" I was promised."

Ideally what you need is prior distribution of possible hypotheses. So for example you might think before any analysis of the data that there is a 90% chance of no effect and if there is an effect you expect the size to be normally distributed about 0 with a SD of 1. Then your prior distribution on the effect size x is f(x)= .9*delta(0)+.1*N(0,1) and then if p(x) is the probability of your observation given an effect size of x you can calculate the posterior distribution of effect size as f(x)*p(x)/int(f(x)*p(x)dx)

If the express goal of the "randomization" is the resulting groups being almost equal on every conceivable axis and any deviation from equal easily invalidates any conclusion you might want to draw... is randomization maybe the worst tool to use here?

Would a constraint solving algorithm not fare much, much better, getting fed the data and then dividing them 10'000 different ways to then randomize within the few that are closest to equal on every axis?

I hear your cries: malpractice, (un-)conscious tampering, incomplete data, bugs, ... But a selection process that reliably kills a significant portion of all research depite best efforts is hugely wasteful.

How large is that portion? Cut each bell curve (one per axis) down to the acceptable peak in the middle (= equal enough along that axis) and take that to the power of the number of axes. That's a lot of avoidably potentially useless studies!

Clinical trial patients arrive sequentially or in small batches; it's not practical to gather up all of the participants and then use an algorithm to balance them all at once.

https://maxkasy.github.io/home/files/papers/experimentaldesign.pdf

This paper discusses more or less this issue with potentially imbalanced randomization from a Bayesian decision theory perspective. The key point is that, for a researcher that wants to minimize expected loss (as in, you get payoff of 0 if you draw the right conclusion, and -1 if you draw the wrong one), there is, in general, one optimal allocation of units to treatment and control. Randomization only guarantees groups are identical in expected value ex-ante, not ex-post.

You don't want to do your randomization and find that you are in the 1% state of world where there is wild imbalance. Like you said, if you keep re-doing your randomization until everything looks balanced, that's not quite random after all. This paper says you should bite the bullet and just find the best possible treatment assignment based on characteristics you can observe and some prior about how they relate to the outcome you care about. Once you have the best possible balance between two groups, you can flip a coin to decide what is the control and what is the treatment. What matters is not that the assignment was random per se, but that it is unrelated to potential outcomes.

I don't know a lot about the details of medical RCTs, but in economics it's quite common to do stratified randomization, where you separate units in groups with similar characteristics and randomize within the strata. This is essentially taking that to the logical conclusion.

1. P-values should not be presented for baseline variables in a RCT. It is just plain illogical. We are randomly drawing two groups from the same population. How could there be a systematical difference? In the words of Doug Altman: ”Performing a significance test to compare baseline variables is to assess the probability of something having occurred by chance when we know that it did occur by chance.”

In this specific study however Im not all that sure that the randomization really was that random. The d-vit crowd in Spain seems to be up to some shady stuff: https://retractionwatch.com/2021/02/19/widely-shared-vitamin-d-covid-19-preprint-removed-from-lancet-server/. My 5c: https://twitter.com/jaralaus/status/1303666136261832707?s=21

Regardless it is wrong to use p-values to choose what variables to use in the model. It is really very straight forward, just decide a priory what variables are known (or highly likely) predictors and put them in the model. Stephen Senn: ”Identify useful prognostic covariates before unblinding the data. Say you will adjust for them in the statistical analysis plan. Then do so.” (https://www.appliedclinicaltrialsonline.com/view/well-adjusted-statistician-analysis-covariance-explained)

2. As others have pointed out, Bayes factors will quantify the evidence provided by the data for two competing hypothesis. Commonly a point nil null hypothesis and a directional alternative hypothesis (”the effect is larger/smaller than 0”). A ”negative” result would not be 1:1, that is a formidably inconclusive result. Negative would be eg 1:10 vs positive 10:1.

1. Exactly. The tests reported in Scott's first table above are of hypotheses that there are differences between *population* treatment groups. They only make sense if we randomized the entire population and then sampled from these groups for our study. Which we don't. I typically report standardized differences for each factor, and leave it at that.

Regarding part 2, when your tests are positively correlated, there are clever tricks you can do with permutation tests. You resample the data, randomly permute your dependent variable, and run your tests, and collect the p values. If you do this many times, you get a distribution of the p values under the null hypothesis. You can then compare your actual p values to this. Typically, for multiple tests, you use the maximum of a set of statistic. The reference is Westfall, P.H. and Young, S.S., 1993. Resampling-based multiple testing: Examples and methods for p-value adjustment (Vol. 279). John Wiley & Sons.

This is the kind of approach I would take for tests that I expect to be correlated. Or possibly attempt to model the correlation structure among the endpoints, if I wanted to do something Bayesian.

Agreed with Westfall&Young minP approach.

You can't run a z-test and treat "fail to reject the null" as the same as "both are equivalent". The problem with that study is that they didn't have nearly enough power to say that the groups were "not significantly different", and relying on a statistical cutoff for each of those comparisons instead of thinking through the actual problems makes this paper useless, in my view.

Worst example: If you have 8 total patients with diabetes (5 in the smaller group and 3 in the larger) you are saying that a 3.5 fold difference in diabetes incidence rate (accounting for the group sizes) is not significant. Obviously that is the kind of thing that can only happen if you are relying on p-values to do your thinking for you, as no reasonable person would consider those groups equivalent. It's extra problematic because, coincidentally, all of those errors happened to point in the same direction (more comorbidities in the control group). This is more of the "shit happens" category of problem though, rather than a flaw in the study design.

There are lots of ways they could have accounted for differences in the study populations, which I expect they tried and it erased the effect. Regressing on a single variable (blood pressure) doesn't count... you would need to include all of them in your regression. The only reason this passed review IMO is because it was "randomized" and ideally that should take care of this kind of issue. But for this study (and many small studies) randomization won't be enough.

I put this paper in the same category of papers that created the replication crisis in social psychology: "following all the rules" and pretending that any effect you find is meaningful as long as it crosses (or in this case, doesn't cross) the magical p=0.05 threshold. The best response to such a paper should be something like "well that is almost certainly nothing, but I'd be slightly more interested in seeing a follow-up study"

You need to specify in advance which characteristics you care about ensuring proper randomization and then do block randomization on those characteristics.

There are limits to the number of characteristics you can do this for, w a given sample size / effect size power calculation.

This logic actually says that you *can* re-roll the randomization if you get a bad one --- in fact, it says you *must* do this, that certain "randomizations" are much better than other ones because they ensure balance on characteristics that you're sure you want balance on.

>Metacelsus mentions the Holmes-Bonferroni method. If I’m understanding it correctly, it would find the ten-times-replicated experiment above significant.

Unfortunately, the Holmes-Bonferroni method doesn't work that way. It always requires that one of the p-values at least be significant in the multiple comparisons sense; so at least one p-value less than 0.05/n, where n is the number of comparisons.

Its strength is that it doesn't require all the p-values to be that low. So if you have five p-values: 0.01, 0.02, 0.03, 0.04, 0.05, then all five are significant given Holmes-Bonferroni, whereas only one would be significant for the standard multiple comparisons test.

I don't claim to have any serious statistical knowledge here, but my intuitive answer is that expected evidence should be conserved.

If you believe that vitamin D reduces COVID deaths, you should expect to see a reduction in deaths in the overall group. It should be statistically significant, but that's effectively a way of saying that you should be pretty sure you're really seeing it.

If you expect there to be an overall difference, then either you expect that you should see it in most ways you could slice up the data, or you expect that the effect will be not-clearly-present in some groups but very-clearly-present in others, so that there's a clear effect overall. I think the latter case means _something_ like "some subgroups will not be significant at p < .05, but other subgroups will be significant at p < (.05 / number of subgroups)". If you pick subgroups _after_ seeing the data, your statistical analysis no longer reflects the expectation you had before doing the test.

For questions like "does ambidexterity reduce authoritarianism", you're not picking a single metric and dividing it across groups - you're picking different ways to operationalize a vague hypothesis and looking at each of them on the same group. But I think that the logic here is basically the same: if your hypothesis is about an effect on "authoritarianism", and you think that all the things you're measuring stem from or are aspects of "authoritarianism", you should either expect that you'll see an effect on each one (e.g. p = .04 on each of four measures), or that one of them will show a strong enough effect that you'll still be right about the overall impact (e.g. p = .01 on one of four measures).

For people who are giving more mathematical answers: does this intuitive description match the logic of the statistical techniques?

"Should they have checked for this right after randomizing, noticed the problem, and re-rolled their randomization to avoid it? I've never seen anyone discuss this point before. The purist in me is screaming no - if you re-roll your randomization on certain results, then it's not really random anymore, is it? But it seems harsh to force them to perform a study even though we know we'll dismiss the results as soon as we get them. If we made them check a pre-written list of confounders and re-roll until there were no significant differences on any of them, what could go wrong? I don't have a good answer to this question, but thinking about it still creeps me out."

The orthodox solution here is stratified random sampling, and it is fairly similar. For example, you might have a list of 2,500 men and 2,500 women (assume no NBs). You want to sample 500, 250 for control and 250 for intervention, and you expect that gender might be a confound. In stratified random sampling, instead of sampling 250 of each and shrugging if there's a gender difference, you choose 125 men for the control and 125 men for the intervention, and do the same for women. This way you are certain to get a balanced sample (https://www.investopedia.com/terms/stratified_random_sampling.asp, for example). While this only works with categorical data, you can always just bin continuous data until it cooperates.

The procedure is statistically sound, well-validated, and commonly practiced.

I know of one randomized experiment which matched people into pairs which were as similar as possible, then for each pair chose randomly which was treatment and which was control.

>I chose four questions that I thought were related to authoritarianism

What you can do if you're trying to measure a single thing (authoritarianism) and have multiple proxies, is to average the proxies to get a single measure, then calculate the p-value using the average measure only. I'd recommend doing that now (even though, ideally, you'd have committed to doing that ahead of time).

For the Vitamin D study, I think you are off track, as were the investigators. Trying to assess the randomness of assignment ex post is pretty unhelpful. The proper approach is to identify important confounders ahead of time and use blocking. For example, suppose that blood pressure is an important confounder. If your total n is 160, you might divide blood those 160 into four blocks of 40 each, based on blood pressure range. A very high block, a high block, a low block, and a very low block. Then randomly assign half within each block to get the treatment, so 20 very high blood pressure folks get the vitamin D and 20 do not. That way you know that blood pressure variation across subjects won't mess up the study. If there are other important confounders, you have to subdivide further. If there are so many important confounders that the blocks get to be too small to have any power, then you needed a much bigger sample in the first place.

You would want to have a data-driven method to identify the confounders that you will use to block. Otherwise, you are embedding your intuition into the process.

deletedApr 6, 2021Comment deletedI don't disagree, but that means you do two sets of analysis. One to identify the confounders and another to test the hypothesis (unless I have misunderstood). Reminiscent of the training / validation approach used for ML, don't you think?

But you do not learn about the possible confounders by doing the study. You learn about them ahead of time from other research, e.g. research that shows that blood pressure interacts strongly with COVID. Blocking can be data driven in the sense that you have data from other studies showing that blood pressure matters. Of you need to do a study to learn what the confounders are, then you need to do another study after that, based on that information.

And, of course, the larger the sample, the less you have to use advanced statistics to detect an effect.

Another alternative: replicate your results. The first test, be sloppy, don't correct for multiple comparisons, just see what the data seems to say, and formulate your hypothesis(es). Then test that/those hypothesis(es) rigourously with the second test.

You can even break your single database into two random pieces, and use one to formulate the hypothesis(es) and the other to test it/them.

Significance doesn't seem like the right test here. When you are testing for significance, you are asking a question about how likely it is that differences in the sample represent differences in the wider population (more or less, technically you are asking for frequentist statistics "if I drew a sample and this variable was random, what is the chance I would see a difference at least this large"). In this case, we don't care about that question, we care if the actual difference between two groups is large enough to cause something correlated with it to show through. At the very least, the Bonferroni adjustment doesn't apply, in fact I would go in the other direction. The difference needs to be big enough that it has some correlation with the outcome strong enough to cause a spurious result.

On section I...

The key point of randomized trials is NOT that they ensure balance of each possible covariate. The key point is that they make the combined effect of all imbalances zero in expectation, and they allow the statistician to estimate the variance of their treatment-effect estimator.

Put another way, randomization launders what would be BIAS (systematic error due to imbalance) into mere VARIANCE (random error due to imbalance). That does not make balance irrelevant, but it subtly changes why we want balance — to minimize variance, not because of worries about bias. If we have a load of imbalance after randomizing, we'll simply get a noisier treatment-effect estimate.

"If the groups are different to start with, then we won't be able to tell if the Vitamin D did anything or if it was just the pre-existing difference."

Mmmmmaybe. If the groups are different to start with, you get a noisier treatment-effect estimate, which MIGHT be so noisy that you can't reject the vitamin-D-did-nothing hypothesis. Or, if the covariates are irrelevant, they don't matter and everything turns out fine. Or, if vitamin D is JUST THAT AWESOME, the treatment effect will swamp the imbalance's net effect anyway. You can just run the statistics and see what numbers pop out at the end.

"Or to put it another way - perhaps correcting for multiple comparisons proves that nobody screwed up the randomization of this study; there wasn't malfeasance involved. But that's only of interest to the Cordoba Hospital HR department when deciding whether to fire the investigators."

No. It's of interest to us because if we decide that the randomization was defective, all bets are off; we can't trust how the study was reported and we don't know how the investigators might have (even if accidentally) put their thumb on the scale. If we instead convince ourselves that the randomization was OK and the study run as claimed, we're good to apply our usual statistical machinery for RCTs, imbalance or no.

"But this raises a bigger issue - every randomized trial will have this problem. [...] Check along enough axes, and you'll eventually always find a "significant" difference between any two groups; [...] if you're not going to adjust these away and ignore them, don't you have to throw out every study?"

No. By not adjusting, you don't irretrievably damage your randomized trials, you just expand the standard errors of your final results.

Basically, if you really and truly think that a trial was properly randomized, all an imbalance does is bleed the trial of some of its statistical power.

See https://twitter.com/ADAlthousePhD/status/1172649236795539457 for a Twitter thread/argument with some interesting references.

Regarding whether hypertension explains away the results, have not read the papers (maybe Jungreis/Kellis did something strictly better), but here's a simple calculation that sheds some light I think:

So 11 out of the 50 treated patients had hypertension. 39 don't.

And 15 out of the 26 control patients had hypertension. 11 don't.

You know that a total of 1 vitamin D patients were ICU'd. And 13 of the control patients were admitted.

There is no way to slice this such that the treatment effect disappears completely [I say this, having not done the calculation I have in mind to check it -- will post this regardless of what comes out of the calculation, in the interest of pre-registering and all that]

To check this, let's imagine that you were doing a stratified study, where you're testing the following 2 hypotheses simultaneously:

H1: Vitamin D reduces ICU rate among hypertension patients

H2: Vitamin D reduces ICU rate among non-hypertension patients.

Your statistical procedure is to

(i) conduct a Fisher exact test on the 11 [treated, hypertension] vs 15 [control, hypertension] patients

(ii) conduct a Fisher exact test on the 39 [treated, no-hypertension] vs 11 [control, no-hypertension] patients

(iii) multiply both by 2, to get the Bonferroni-corrected p-values; accept a hypothesis if its Bonferroni-corrected p-value is < 0.05

If we go through all possible splits* of the 1 ICU treated patient into the 11+39 hypertension+non-hypertension patients and the 13 ICU control patients into the 15+11 hypertension+non-hypertension patients (there are 24 total possible splits), the worst possible split for the "Vitamin D works" camp is if 1/11 hypertension & 0/39 non-hypertension treated patients were ICU, and 10/15 hypertension & 3/11 non-hypertension control patients were ICU.

In this case still you have a significant treatment effect (the *corrected* p-values for H1 and H2 are 0.01 and 0.02 in this case).

I don't know how kosher this is formally, but it seems like a rather straightforward & conservative way to see whether the effect still stands (and not a desperate attempt to wring significance out of a small study, hopefully), and it does seem to stand.

This should also naturally factor out the direct effect that hypertension may have on ICU admission (which seems to be a big concern).

Other kinds of uncertainty might screw things up though - https://xkcd.com/2440/ -- given the numbers, I really do think there *has* to have been some issue of this kind to explain away these results.

*simple python code for this: https://pastebin.com/wCpPaxs8

Seems like it should be relevant that sunlight, which is a cause of vitamin D, also reduces blood pressure through nitric oxide.

https://www.heart.org/en/news/2020/02/28/could-sunshine-lower-blood-pressure-study-offers-enlightenment

My suspicion is that you are reaching the limits of what is possible using statistical inference. There might be, however, alternative mathematical approaches that might provide an answer to the real question "should we prescribe X and, if so, to whom?". I refer specifically to optimisation-based robust classification / regression (see e.g. work by MIT professor Bertsimas https://www.mit.edu/~dbertsim/papers.html#MachineLearning - he's also written a book on this). But I would still worry about the sample size, it feels small to me.

For the ambidexterity question, what was your expected relation between those four questions and the hidden authoritarianism variable? Did you expect them all to move together? All move separately but the more someone got "wrong" the more authoritarian they lean? Were you expecting some of them to be strongly correlated and a few of them to be weakly correlated? All that's to ask: is one strong sub-result and three weak sub-results a "success" or a "failure" of this prediction? Without a structural theory it's hard to know what to make of any correlations.

--------------

Then, just to throw one more wrench in, suppose your background doesn't just change your views on authoritarianism, but also how likely you are to be ambidextrous.

Historically, in much of Asia and Europe teachers enforced a heavy bias against left-handedness in activities like writing. [https://en.wikipedia.org/wiki/Handedness#Negative_connotations_and_discrimination] You don't see that bias exhibited as much by Jews or Arabs [anecdotal], probably because of Hebrew and Arabic are written right-to-left. But does an anti-left-handed bias decrease ambidexterity (by shifting ambidextrous people to right-handed) or increase ambidexterity (by shifting left-handed people to ambidextrous)? Does learning language with different writing directions increase ambidexterity? Is that checkable in your data?

Most of us learned to write as children [citation needed], and since most of your readership and most of the study's respondents are adults [citation needed] they may have already been exposed to this triumph of nurture over nature. It's possible that the honest responses of either population may not reflect the natural ambidexterity rate. Diving down the rabbit hole, if the populations were subject to these pressures, what does it actually tell us about the self-reported ambidextrous crowd? Are the ambidextrous kids from conservative Christian areas the kids who fell in line with an unreasonable authority figure? Who resisted falling in line? Surely that could be wrapped up in their feelings about authoritarianism.

These examples are presented as being similar, but they have an important distinction.

I agree that testing p-values for the Vitamin D example doesn't make too much sense. However, if you did want to perform this kind of broad ranging testing, I think you should be concerned with the false discovery rate rather than the overall level of the tests. Each of these tests is, in some sense, a different hypothesis, and should receive it's own budget of alpha.

The second example tests as single hypothesis in multiple ways. Because it's a single hypothesis, it could make sense to control the overall size of the test at 0.025. However, because these outcomes are (presumably) highly correlated, you should use a method that adjusts for the correlation structure. Splitting the alpha equally among four tests is unnecessarily conservative.