How to select controls for your cases

This topic is compiled from Mark Daly's FinnGen user meeting presentation on the 21st of September 2021. A link to the recording can be found here.

Control selection: (when) does it matter? Power considerations in GWAS design

Approximately 20 years ago, when genotyping was the major expense in genetic studies, we had to design efficient studies which minimized the number of people genotyped. Today, with hundreds of thousands of individuals genotyped in FinnGen, we have the option to select more restrictive or larger and more inclusive sets of controls for our GWAS study.

Two common & competing concerns on selecting controls

There are two common and competing questions that recur during the past 20 years of performing large-scale case-control association studies.

Concern 1: "If my controls have cases in them, doesn't that ruin the power of accuracy of my GWAS"
- Answer: Be more selective about control choice
Concern 2: "If I do not select enough controls, aren't I leaving some power on the table?"
- Answer: Be more inclusive about controls

Tools within FinnGen Sandbox allow you to choose not only your case group but also your control group, very carefully.

Some intuition to start with

There are two intuitions that we can think of based on intuitions_:_

Intuition 1:

If there are 'cases' mixed in with controls, if the rate of those cases is very low, it will not impact the final answer much

What happens if we mix our cases in with our controls and we don’t know they are cases. That is sort of the worst-case scenario.

You can think intuitively that if the rate of those cases is pretty low, it’s not going to have much of an effect on your study. That is because when you do case-control study you are comparing the frequency of cases to the frequency of controls, and if you just sprinkle a few cases in with the controls, it really doesn’t change your frequency that much.

However, if you sprinkle a lot of cases with the controls, you will see a big difference in the frequency in your control set, over what it could be, and you will have a big loss of power because of that.

Therefore (and elaborated further in image below),

if 1% of controls are cases; it doesn’t change the frequency of controls that much.
if 20% of controls are cases; it could change the frequency substantially.

Intuition 2:

Keep in mind that at a certain point, adding more controls can’t help your power.

For example;

if you have 100 cases, it doesn’t matter whether you compare them to 10 000 or 100 000 controls, because the fluctuation in the frequency of 100 cases dominates the statistic that you’re going to calculate. Those two scenarios have exactly the same power.

Therefore, beyond a certain point, adding more controls will not increase power meaningfully.

What does the choice of controls actually matter?

When we are performing an association test, essentially what we are doing we are comparing allele frequency of cases to allele frequency of controls.

That statistical comparison itself has a certain set of outcomes that are likely going to happen and are normally distributed.

For the distribution of test statistics, we will use here a Z-score. Descriptions in this subsection will be centered around Figure 1 on Power Primer

Distribution on left

There is no true association between variant and phenotype.

A few percent of the time you will exceed a Z-score of two (or you will go below a Z score of negative two), and it might appear that there’s a little bit of association, but that’s simply the normal variability that occurs by chance when you run a statistical test of this nature on data where there is no real association.

A significance threshold (a pre-defined threshold for rejection of the null hypothesis) is traditionally called alpha. That represents, therefore, the proportion of the time that your test statistic will exceed a threshold when there actually is no effect or the null hypothesis is true. This is most often referred to as a Type 1 error rate or just false positive rate if you prefer.

Distribution on the right

There is actually a genuine association. This is a distribution of the test statistic under the alternative hypothesis that a specific variant has a certain effect size and a certain sample size.

That distribution governs how much power you have. When we perform a power calculation in genetics, we are using these models; the test statistics and their distributions (which are generally normal). We calculate what are the possible outcomes based on the effect size that you would get under the alternative hypothesis. Power is calculated directly from these closed-form equations rather than doing millions and millions of simulations.

Beta or the Type 2 error rate is "1 minus the power". So that the power is the chance that your variant association test exceeds this alpha threshold, under the specified model. In this case this association has pretty good power to exceed p-value of 0.05. However, p=0.05, as we know, is not very strict. So if we think of this same model under a p-value being evaluated at an alpha of 5 x $10^{-8}$ , a conventional genome-wide significance threshold, we see that the same model has a very tiny power to exceed 5 x $10^{-8}$ (Figure 2). On the good side, there is really no chance whatsoever that alpha will be exceeded by chance. However, there is also very little chance that we exceed it under the true hypothesis of the associations.

The pink space in Figure 2 is where we spend a lot of our time in genetic studies - where we don’t quite have conclusive power to exceed 5 x $10^{-8}$ , but there is potentially a real signal that we are trying to sort out what we think of.

When we increase effect size, we move the alternative hypothesis distribution to the right (Figure 5). The expected Z-score goes from 3 to 5, and that would give us very reasonable chance to exceed even a p-value of 5x10-8

Increasing the sample size has the same impact on the test statistics as increasing the effect size (Figure 6). This likewise moves that expected test statistic distribution further over.

We spend a lot of time speaking about expected chi-square ( $\chi^2$ ) or the $\chi^2$ noncentrality parameter, which is basically just the expected value of the chi-square.

First of all, what is a $\chi^2$ ? It is simply a sum of Z-scores, or the sum of random deviates. The traditional 1 degree of freedom chi-square is simply the distribution of $Z^2$ .

This sometimes strikes people as a little bit strange or antiquated, because we have spent the last 20 years telling people not to use the $\chi^2$ test for association. While it’s true we don’t use it for association, it has a number of fantastic properties that make it very useful in the context of power calculation (Figure 7).

One useful property of the $\chi^2$ is that it scales linearly with sample size - unlike power, which is some complicated function of whatever threshold you happen to select as being interested in. When you double the sample size of the study, you double the $\chi^2$ for any truly associated variant.

It has also another really cool property: the reason we use $r^2$ in preference to other measures of linkage disequilibrium between sites, is that the expected chi-square of a SNP in LD with a causal variant is simply the product of $r^2$ between the two sites and the chi-square at the causal SNP. This provides easy opportunities for exploring the behavior of LD and association statistics together.

Example cases:

Coding variant (R20Q) in a gene SPDL1 protects from all kinds of cancer and confers quite a strong risk to idiopathic pulmonary fibrosis (Figure 8 and 9). **** This is one of the very interesting findings in the first years of FinnGen.

This example is selected because it gives a very nice way of exploring what is the actual practical effect of contamination of controls.

In the variant R20Q there is 3% allele that is protective against all forms of cancer (Figure 9)

Genetic power calculator is a great tool for calculating power (Figure 10). It is developed by Shaun Purcell more than 15 years ago. It allows you to perform these closed-form power calculations in a variety of different experimental designs and settings. All you need to do is enter the parameters of your variant frequency, effect size, sample size, case-control ratio and alpha and beta thresholds of interest.

First modeling exercise: we plug in a model, and we plug in the prevalence of all cancer and compare what we might expect to see in prostate cancer study, which has 10K cases and 100K non-cancer controls, or 125K all-male controls so that the last 25K represents non-prostate cancer cases that are male and included in the analysis (Figure 11).

What you see here is that these two different models are approaches to studying in the context of prostate cancer resulting in quite different significance and power to exceed 5 x $10^{-8}$ (Figure 11). Almost ¾ orders of magnitude difference in p-value at this relatively meaningful level, where we actually still care about the p-values because they are neither so large nor so small that we are not interested in the exact number.

That emerges from the fact that by selecting non-cancer cases to controls when cancer prevalence is 20% the control set doesn’t have the population frequency of 3%. It actually has a higher frequency because these are the individuals that apparently had some protection from cancer.

So looking at this you would naturally say, in the case of cancer we certainly want to choose the option that gives us better power, so we would choose the smaller number of non-cancer controls.

However, someone might then point out what if this was a prostate cancer only association, and we would be leaving out 25K controls, and therefore hurting our power for the discovery of other loci.

In this case, because the ratio is already 10 controls for every case, it really makes absolutely no difference whatsoever as you can see from the power and expected chi-square - adding in 25K more pure non-cancer controls with the 100K changes the association statistics negligibly (Figure 12).

This gives us a pointer into the first point of intuition regarding how many times oversampling of controls we actually need and in the context of conventional GWAS.

From the graph below, you can see that you have 5-10 times the number of controls as you do cases, there is really very little advantage in going any further than that by oversampling more controls.

That’s a good rule of thumb to keep in mind. No reason to be concerned if you already have 10-15 times more controls than cases, you are really not going to get any advantage in GWAS by trying to pull in even more controls - you need more cases if you want to boost power.

One percent of contamination in the control rate really doesn't make very much difference in the strength of these associations (Figure 14)

Summary

Control contamination is something relevant to consider when the genetic variants may be shared with a common disease. Hence, there is really no reason to be concerned about your controls having ~1% of cases.

Additionally, increase control to case ratio beyond 10:1 really does not add much to power.

Finally, if you are concerned about computing power, we would suggest using a power calculator to explore specific scenarios yourself

web based tool http://zzz.bwh.harvard.edu/gpc/
Windows OS based tool http://csg.sph.umich.edu/abecasis/cats/
And many others out there. There is a plethora of materials and tools in the field of power calculation in genetics !

PreviousVariant PheWas NextUsing the R libraries to look at Phenotype data

Last updated 2 years ago

Was this helpful?

hashtagControl selection: (when) does it matter? Power considerations in GWAS design

hashtagTwo common & competing concerns on selecting controls

hashtagSome intuition to start with

hashtagWhat does the choice of controls actually matter?

hashtagDistribution on left

hashtagDistribution on the right

hashtagExample cases:

hashtagSummary