LogoLogo
FinnGen Handbook
  • Introduction
  • Where to begin
    • Quick guides
      • New to FinnGen
      • Green data users
      • Red data users
    • I'm new to FinnGen, where is the best place for me to start?
    • What kind of questions can I ask of FinnGen data?
    • How do I make a custom endpoint?
    • How do I run a GWAS of a phenotype I created myself?
    • I'm interested in FinnGen rare variant phenotypes
  • Background Concepts
    • Basics of Genetics
    • Linkage Disequilibrium (LD)
    • Genotype Imputation
    • Genotype Data Processing and Quality Control (QC)
    • GWAS Analysis
    • P Values
    • Heritability and genetic correlations
    • Finemapping
    • Conditional analysis
    • Colocalization
    • Using Polygenic Risk Scores
    • PheWAS analysis
    • Survival analysis
    • Longitudinal Data Analysis
    • GWAS Association to Biological Function
    • Genetic Data Resources outside FinnGen
    • Getting Started with Unix
    • Getting Started with R
    • Structure of the FinnGen project
    • Finnish gene pool and health register data
  • FinnGen Data Specifics
    • FinnGen Data Freezes and Releases
    • Analysis proposals
      • What is a FinnGen analysis proposal and when do I need to submit one?
      • How do I submit an analysis proposal?
      • How are analysis proposals handled?
      • What is a FinnGen bespoke analysis proposal and when do I need to submit one?
      • How do I submit a bespoke analysis proposal?
      • How are bespoke analysis proposals handled?
      • What is the difference between FinnGen analysis proposals and FinnGen bespoke analyses?
      • Existing analysis proposals
    • Finnish Health Registries and Medical Coding
      • Finnish health registries
      • Register data pre-processing
      • Data Masking/Blurring of Visit Dates
      • International and Finnish Health Code Sets
      • More information on health code sets
      • VNR code mapping to RxNorm
      • Register code translation files
    • Endpoints
      • FinnGen clinical endpoints
      • History of creating the FinnGen endpoints
      • Location of FinnGen Endpoint and Control Description Files
        • What's new in DF13 endpoints
        • What’s new in DF12 endpoints
        • What’s new in DF11 endpoints
        • What’s new in the DF10 endpoints
        • What’s new in DF9 endpoints
        • What’s new in DF8 endpoints
      • Interpretation of Endpoint Definition file
      • Location of Endpoint Quality Control Report
      • Creating a User-defined Endpoint(s)
      • Requesting a User-defined Endpoint to be included in Core Analysis
      • Complete follow-up time of the FinnGen registries – primary endpoint data
        • Survival analysis using the truncated endpoint file – secondary endpoint data
    • Biobanks in Finland
    • Publishing FinnGen results
      • Preparing manuscripts or conference abstracts
      • The 1-year “Exclusivity Period” Policy
      • List of Publications using FinnGen Data
      • How to share GWAS summary statistics with FinnGen community
      • How to publish GWAS summary statistics
      • Public Result Releases
    • Red Library Data (individual level data)
      • Genotype data
        • Genotype Arrays Used
          • Legacy cohorts and chips
        • Imputation Panel
          • Sisu v4 reference panel
          • Sisu v3 reference panel
          • Sisu v4.2 reference panel
            • Variant-wise QC metrics file
        • Genome build used in FinnGen
        • Genotype Data Processing Flow
        • Genotype Files in Sandbox
          • Imputed genotypes in VCF format
          • Imputed genotypes in BGEN format
          • Imputed genotypes in PLINK format
          • Chip data
          • Imputed HLA alleles
          • Principal components analysis (PCA) data
          • Kinship data
          • Analysis covariates
          • Polygenic risk scores (PRS)
          • Genetic Ancestry
          • Genetic relationships (GRM)
          • Mosaic chromosomal alterations (mCA)
          • Prune data (R9)
          • Imputed STR genotypes (R8)
      • Phenotype data
        • Register data
        • Detailed longitudinal data
          • Splitting combination codes in detailed longitudinal data
        • Service sector data
          • Service sector data code translations
        • Endpoint and endpoint longitudinal data
        • Kanta lab values
          • Data
          • FAQ
          • How-to guides
        • Kanta prescriptions
        • Minimum extended phenotype data
          • Extracting minimum phenotype data per biobank
          • DNA isolation protocols per biobank
        • Minimum longitudinal data
        • Minimum phenotype data (before R11)
        • Cohort data (before R11)
        • Other register data files in Sandbox
          • Register of Congenital Malformations
          • Finnish Registry for Kidney Diseases
          • Reproductive history data
          • Finnish Cancer Registry: Cervical cancer screening
          • Finnish Cancer Registry: Breast cancer screening
          • Finnish Cancer Registry: Detailed cancer data
          • Finnish Register of Visual Impairment
          • Parental cause of death data
          • Ejection fraction data
          • Finnish National Infectious Disease Register
          • Finnish National Vaccination Register
          • Covid-19 primary care data
          • Blood donor data from the Finnish Red Cross Blood Service (FRCBS)
          • Dental data
          • Socioeconomic data
          • Hilmo and avohilmo extended data
      • Omics data
        • Proteomics
          • Expansion Area 5 proteomics data
          • FinnGen 3 proteomics data
        • Metabolomics
        • Single-cell transcriptomics and immune profiling
        • High-content cell imaging
        • Full blood counts and clinical chemistry
      • Hospital administered medications
      • Whole exome sequencing (WES) data
    • Green Library Data (aggregate data)
      • What is "Green" Data?
      • Accessing Green Data
      • Other analyses available
        • Colocalizations in FinnGen
        • Autoreporting – information on overlaps
          • Index of Autoreporting variables
        • HLA
        • LoF burden test
        • Meta-analyses
      • Core analysis results files
        • Recessive GWAS results format
        • Variant annotation file format
        • Genotype cluster plots format
        • GWAS results format
        • Finemapping results format
        • Colocalization results format
          • Results format in colocalization before DF13
        • Autoreporting results format
        • Sex-specific GWAS results format
        • UKBB-FinnGen meta-analysis file formats
        • Pairwise endpoint genetic correlation format
        • Heritabilities
        • Coding variant associations format
        • HLA association results
        • Proteomics results
        • Coding variant results including CHIP EWAS (Exome-Wide Association Scan)
        • Kanta lab association results v1
    • Disease specific Task Force data
      • Inflammatory bowel disease (IBD) SNOMED codes data
    • Expansion Area 3 (EA3) studies
      • EA3 study: Fatty liver disease study and data in Sandbox
      • EA3 study: Age-related macular degeneration study and data in Sandbox
      • EA3 study: Women's health studies
        • EA3 study: Women’s health – Endometriosis and data in Sandbox
        • EA3 study: Human papilloma virus-related gynecological lesions, and data in Sandbox
        • EA3 study: Women’s health – PCOS and infertility study, and data in Sandbox
      • EA3 study: Diabetic Kidney Disease and Rare Kidney Disease study and data in Sandbox
      • EA3 study: Oncology studies
        • EA3 study: Oncology – Breast cancer study and data in Sandbox
        • EA3 study: Oncology –Prostate cancer study and data in Sandbox
        • EA3 study: Oncology – Ovarian cancer study and data in Sandbox
      • EA3 study: Pulmonary diseases (IPF, asthma and COPD) study and data in Sandbox
      • EA3 study: Immune-mediated diseases
      • EA3 study: Heart Failure study and data in Sandbox
      • FinnGen EA3 leads
  • Disease Specific Task Forces
    • Inflammatory bowel disease (IBD)
    • Kidney Diseases
    • Eye Diseases
    • Rheumatic Diseases
    • Atopic Dermatitis
    • Pulmonary Diseases
    • Neurological Diseases
    • Heart Failure
    • Fibrotic Diseases
    • Metabolic diseases
    • Parkinson's diseases
  • Working in the Sandbox
    • How to get started with Sandbox
    • What is Sandbox and what can you do there
    • What do we mean by "red" and "green" data?
    • General workflows for the most common analyses
    • Quirks and Features
      • Managing your files in Sandbox
      • Navigating the Sandbox
      • How to save Sandbox window configuration
      • Copying and pasting in and out of your IVM
      • How to report issues from within the Sandbox
      • Sharing individual-level data within the Sandbox
      • How to download results from your IVM
        • Sandbox download requests – rules and examples for minimum N
      • Keyboard combinations
      • Running analyses in your IVM vs. Pipelines
      • Timeouts and saving your work (backups, github)
      • How to install a R package into Sandbox?
        • How to install R packages with many dependencies
      • Install R and Python packages from the local Sandbox repository
      • How to install a Python package into Sandbox
      • How to install GNU Debian package
      • How to upload your own files to IVM via /finngen/green
      • How to remove files from /finngen/green
      • Using Sandbox as a Chrome application (full screen mode)
      • How to reset your finngen.fi account password
      • Sandbox IVM tool request handling policy
      • Docker images
        • How to get a new Docker image to Sandbox
        • How to mount data into Docker container image
        • Containers available to Sandbox
        • Containers with user customized tool sets
        • How to write a Docker file
        • Anaconda Python environment in the Sandbox
      • Python Virtual Environment in Sandbox
      • How to shut down your IVM
    • Which tools are available?
      • FinnGen exome query tool
      • Custom GWAS tools
        • Custom GWAS GUI tool
        • Custom GWAS command line (CLI) tool
          • Custom GWAS CLI Binary mode
          • Custom GWAS CLI Quantitative mode
        • How to make your summary stats viewable in a PheWeb-style?
        • Finemapping of Custom GWAS analyses
        • PheWeb Users Input Validator tool
        • Conditional analysis of Custom GWAS analyses
      • Pipelines
      • Pre-installed Linux tools
      • PGS Browser
      • Lmod Linux tools
      • Anaconda Python module with ready set of scientific packages
      • Python packages
      • R packages
      • Atlas
        • Quick guide
          • Introduction to OHDSI, OMOP CDM and Atlas
          • From research question to concepts and cohort building
          • Using Atlas in Sandbox
          • Examples on cohort building with Atlas
        • Detailed guide
          • Atlas data model
          • Standard and non-standard codes
          • How to define a cohort in Atlas
            • Select FinnGen data release in Atlas for Search
            • How to define a simple ICD case-control cohort in Atlas
              • Define a simple ICD Concept Set in Atlas
              • Define a simple ICD case cohort in Atlas
              • Define a simple ICD control cohort in Atlas
            • Concept Sets
              • Create Concept Sets using descendants
              • Exclude and Remove codes from Concept Set
              • Simplify Concept Sets that use standard code descendants
              • Create Concept Sets using equivalent standard and non-standard codes
              • View standard code hierarchy in Atlas
            • Cohort Definitions
              • Using the Death register in Atlas
              • Filtering by clinical registries in Atlas
              • Filtering by demographic criteria in Atlas
              • Defining exit rules for a cohort in Atlas
              • Selecting the correct box in Atlas for events and medical codes
            • How to export FinnGen IDs from Atlas
          • Downstream analyses after the Atlas cohorts are created
          • Data Release Summary Statistics in Atlas
          • Cohort Summary Statistics in Atlas
            • Time-dependent Cohort Summary Statistics in Atlas
            • Event inclusion in Cohort Summary Statistics in Atlas
          • Cohort Pathways
      • BigQuery (relational database)
      • Atlas vs BigQuery cohorts
      • Genotype Browser
      • Cohort Operations tool (CO)
        • Upload cohorts to CO
        • Combine cohorts with CO
        • Operate on Atlas cohorts and data with entries and exit events
        • Explore code and endpoint enrichments with CO (CodeWAS)
        • Explore endpoint overlaps with CO
        • Compare custom endpoint to FinnGen endpoint with CO
        • Launch custom GWAS with CO
        • Export FinnGen IDs using CO
        • Understanding phenotypic overlaps using CO
      • Trajectory Visualization Tool (TVT)
        • Running TVT
          • Filtering timelines with TVT
          • Reordering timelines with TVT
          • Clustering timelines with TVT
          • Viewing TVT results
        • Viewing Atlas, CO, and Genotype cohorts in TVT
        • Exporting cohorts from TVT
        • TVT help page
      • LifeTrack
      • Miscellaneous helper scripts/tools
        • Tool to annotate variants with RSIDs
        • Proper translations of medical, service sector and provider codes
        • BigQuery Connection – R
          • Case study – All register data for a person
          • Case study – UpSet plot
          • Case study – Tornado plot
          • Case study – defining simple cohorts using medical codes for running case-control GWAS
        • BigQuery Connection - Python
          • BigQuery Python - Downstream analysis - Active Ingredient - Bar plot
          • BigQuery Python - Case Study - Sex different - Tornado plot
          • BigQuery Python - Case Study - Comorbidity - Upset plot
          • BigQuery Python - Case Study - Patient Timeline - Scatter plot
      • Sandbox internal API for software developers
    • Working with Phenotype Data
      • Variant PheWas
      • How to select controls for your cases
      • Using the R libraries to look at Phenotype data
      • How to check case counts from the data
      • Creating your own user-defined endpoint
    • Working with Genotype Data
      • Genotype Browser how to
      • Cluster Plots
      • ClusterPlot viewer V3C
      • Rare Variant Calling in V3C
      • Create map of allele
      • Genotypes from VCF files
      • Variant PheWas
      • Interpreting rare-variant analysis results
      • Tools for geno-pheno explorations
        • Example: transferring data from Genotype Browser to LifeTrack
        • Example: Visualizing Genotype Browser output data with TVT
    • Running analyses in Sandbox
      • How to run survival analyses
      • How to create custom endpoint using bigquery: example
      • How to use the Pipelines tool
      • How to submit a pipeline from the command line (finngen-cli)
      • How to run genome-wide association studies (GWAS)
        • How to run GWAS using REGENIE
        • Running quantitative GWAS with REGENIE
        • Conditional analysis
        • Conditional Analysis with custom regions and loci
        • How to run GWAS using SAIGE
        • Adding new covariates in GWAS using REGENIE and SAIGE
        • How to run GWAS using plink2 (for unrelated individuals only)
        • How to run GWAS using GATE (survival models)
        • How to run trajGWAS
        • How to run GWAS using the Regenie unmodifiable pipeline
        • How to run an interaction GWAS using the Regenie unmodifiable pipeline
        • How to run survival analysis using GATE unmodifiable pipeline
        • How to run GWAS on imputed HLA alleles using Regenie
      • How to run finemapping pipeline
        • Finemapping with custom regions in DF12
        • Unmodifiable Finemapping pipeline
      • How to run colocalization pipeline
      • How to run the LDSC pipeline
      • How to run PRS pipeline
      • How to calculate PRS weights for FinnGen data
      • Sandbox path and pipeline mappings
      • If your pipeline job fails
      • Tips on how to find a pipeline job ID
      • Managing memory in Sandbox and data filtering tips
      • Using Google Life Sciences API in Sandbox
      • Pipelines is based on Cromwell and WDL
    • Billing information and where to find more details
      • Monitoring Sandbox costs by Sandbox billing report
      • Monitoring Sandbox costs directly from your Google billing account
  • Working outside the Sandbox
    • Risteys
    • Endpoint Browser
    • PheWeb
      • Volcano plots with LAVAA
    • Meta-analysis PheWeb(s)
    • Coding variant browser
    • Multiple Manhattan Plot (MMP)
      • How to prepare an input file for MMP
      • How to use MMP
    • LD browser
    • Green library data
  • FAQ
    • FinnGen Spin Offs
    • FinnGen access and accounts
      • How do I apply for data access?
      • What is "red" or "green" data?
      • I already have green data access, how do I apply for red data access?
      • I cannot access the /finngen/red?
      • How do I enable two-factor authentication (2FA)?
      • I cannot access my FinnGen account?
      • How to reset account credentials
      • What to do if you suspect your account has been compromised
      • Can't access your smartphone for 2FA?
      • How do I access the FinnGen members' area?
      • How do I access FinnGen All Sharepoint?
      • How can I view existing analysis proposals?
      • How can I join the FinnGen Slack?
      • How do I join the FinnGen Teams group?
      • How to apply SES sandbox access
      • How to request a FinnGen account?
    • FinnGen data
      • What to do if I think I found a mistake in the data?
      • What are the field/column names in FinnGen?
      • What covariates are used in FinnGen's core GWAS analyses?
      • Does FinnGen have lab results available?
      • Does FinnGen have family and relatedness information available?
      • Where can I find a list of unrelated individuals in FinnGen?
      • When moving from BCOR to .txt files, what does the column called "correlation" mean?
      • Is there really no participant birth year data?
      • How do I calculate time between events?
      • Can I select only the columns needed for my analysis to import into RStudio?
      • What is the difference is between LD-clumping and the Saige conditional analysis?
      • Can I download all pairwise LD data across the genome at once?
      • How to find latest data releases?
      • Why are there differences in the GWAS results between Data Freezes/Releases?
    • Where can I find
      • COVID association results?
      • Users' Meeting materials?
      • A list of what coding variants are enriched in Finland?
      • A comprehensive list of key file locations in FinnGen?
      • Medical code translations?
    • PheWeb
      • What are QQ and Manhattan plots?
      • How can I access PheWeb?
      • Are fine-mapping results that available in PheWeb also available as flat files?
      • Do the autoreports report the 95% or 99% credible set?
    • Registries
      • What do KELA reimbursement codes map to?
      • What's the cutoff date for FinnGen data?
    • Sandbox
      • What is the FinnGen Sandbox?
      • Why does my IVM freeze while loading data into R/Rstudio
      • Where can I find tutorials and documentation on Sandbox?
      • How do I get my own analysis code into Sandbox?
      • Where to ask for software you'd like to see in Sandbox
      • Can I share individual level data between different Sandbox users?
      • Is there a sun grid engine for running long scripts?
      • How to clear browser cache after sandbox update
      • How do I increase the window resolution on my IVM?
      • How can I view pdf, jpg and HTML files?
      • My Sandbox job was killed - why?
      • How to unzip files in the command line
      • Why aren't my keyboard/shortcuts working in Sandbox like they do in my local computer?
      • How to know if my pipeline job was failed due preemption of worker VM
    • Risteys
      • Why is the case number dropping after the "Check pre-conditions, main-only, mode, ICD version" step?
    • Endpoints
      • Where do I find the most recent list of FinnGen endpoints?
      • What does it mean when an endpoint has “mode” at the end?
      • What scenario would cause an NA (missing data) entry rather than a zero?
      • Does it mean anything when a value is written as $!$ instead of NA?
      • Why is there an inconsistency between ICD10 code J84.1 (IPF) and J84.112?
      • How are control endpoints calculated?
      • Can I get a list of FinnGen IDs by control group for my endpoint?
      • What does Level C mean in the endpoints data table?
      • What does the SUBSET_COV field show?
      • Why is there a "K." prefix on some endpoints?
      • Why there are fewer endpoints going from R5 (N = 2,925) to R8 (N = 2,202)?
      • Should I include primary care registry (PRIM_OUT) codes in my cohort definitions?
      • I found BL_AGE after FU_END_AGE in the endpoint data, how is it possible?
      • Why do individuals who are not dead have death age in endpoint data?
      • I found EVENT_AGE after FU_END_AGE in endpoint data, how is it possible?
    • Pipelines
      • Are there example SAIGE pipelines?
      • How do I apply finemapping to my SAIGE results?
      • Why Pipelines is claiming that my files or folders are not in /finngen/red?
    • Citing
      • How do I cite analysis using publicly available FinnGen results?
      • How do I cite FinnGen results that use individual level data?
    • For biobanks
      • How to apply for data return
    • Data Security and Protection
      • How do I report a data breach?
  • Release Notes
    • Data Releases 2025
    • Data Releases 2024
    • Data Releases 2023
    • Data Releases 2022
    • Data Releases 2021
  • Tool Catalog
  • Glossary
  • User Support
  • Data Protection & Security
Powered by GitBook
On this page
  • Control selection: (when) does it matter? Power considerations in GWAS design
  • Two common & competing concerns on selecting controls
  • Some intuition to start with
  • What does the choice of controls actually matter?
  • Example cases:
  • Summary

Was this helpful?

  1. Working in the Sandbox
  2. Working with Phenotype Data

How to select controls for your cases

PreviousVariant PheWasNextUsing the R libraries to look at Phenotype data

Last updated 1 year ago

Was this helpful?

This topic is compiled from Mark Daly's FinnGen user meeting presentation on the 21st of September 2021. A link to the recording can be found .

Control selection: (when) does it matter? Power considerations in GWAS design

Approximately 20 years ago, when genotyping was the major expense in genetic studies, we had to design efficient studies which minimized the number of people genotyped. Today, with hundreds of thousands of individuals genotyped in FinnGen, we have the option to select more restrictive or larger and more inclusive sets of controls for our GWAS study.

Two common & competing concerns on selecting controls

There are two common and competing questions that recur during the past 20 years of performing large-scale case-control association studies.

  • Concern 1: "If my controls have cases in them, doesn't that ruin the power of accuracy of my GWAS"

    • Answer: Be more selective about control choice

  • Concern 2: "If I do not select enough controls, aren't I leaving some power on the table?"

    • Answer: Be more inclusive about controls

Tools within FinnGen Sandbox allow you to choose not only your case group but also your control group, very carefully.

Some intuition to start with

There are two intuitions that we can think of based on intuitions_:_

Intuition 1:

If there are 'cases' mixed in with controls, if the rate of those cases is very low, it will not impact the final answer much

What happens if we mix our cases in with our controls and we don’t know they are cases. That is sort of the worst-case scenario.

You can think intuitively that if the rate of those cases is pretty low, it’s not going to have much of an effect on your study. That is because when you do case-control study you are comparing the frequency of cases to the frequency of controls, and if you just sprinkle a few cases in with the controls, it really doesn’t change your frequency that much.

However, if you sprinkle a lot of cases with the controls, you will see a big difference in the frequency in your control set, over what it could be, and you will have a big loss of power because of that.

Therefore (and elaborated further in image below),

  • if 1% of controls are cases; it doesn’t change the frequency of controls that much.

  • if 20% of controls are cases; it could change the frequency substantially.

Intuition 2:

Keep in mind that at a certain point, adding more controls can’t help your power.

For example;

  • if you have 100 cases, it doesn’t matter whether you compare them to 10 000 or 100 000 controls, because the fluctuation in the frequency of 100 cases dominates the statistic that you’re going to calculate. Those two scenarios have exactly the same power.

Therefore, beyond a certain point, adding more controls will not increase power meaningfully.

What does the choice of controls actually matter?

When we are performing an association test, essentially what we are doing we are comparing allele frequency of cases to allele frequency of controls.

That statistical comparison itself has a certain set of outcomes that are likely going to happen and are normally distributed.

For the distribution of test statistics, we will use here a Z-score. Descriptions in this subsection will be centered around Figure 1 on Power Primer

Distribution on left

There is no true association between variant and phenotype.

A few percent of the time you will exceed a Z-score of two (or you will go below a Z score of negative two), and it might appear that there’s a little bit of association, but that’s simply the normal variability that occurs by chance when you run a statistical test of this nature on data where there is no real association.

A significance threshold (a pre-defined threshold for rejection of the null hypothesis) is traditionally called alpha. That represents, therefore, the proportion of the time that your test statistic will exceed a threshold when there actually is no effect or the null hypothesis is true. This is most often referred to as a Type 1 error rate or just false positive rate if you prefer.

Distribution on the right

There is actually a genuine association. This is a distribution of the test statistic under the alternative hypothesis that a specific variant has a certain effect size and a certain sample size.

That distribution governs how much power you have. When we perform a power calculation in genetics, we are using these models; the test statistics and their distributions (which are generally normal). We calculate what are the possible outcomes based on the effect size that you would get under the alternative hypothesis. Power is calculated directly from these closed-form equations rather than doing millions and millions of simulations.

Beta or the Type 2 error rate is "1 minus the power". So that the power is the chance that your variant association test exceeds this alpha threshold, under the specified model. In this case this association has pretty good power to exceed p-value of 0.05. However, p=0.05, as we know, is not very strict. So if we think of this same model under a p-value being evaluated at an alpha of 5 x10−810^{-8}10−8 , a conventional genome-wide significance threshold, we see that the same model has a very tiny power to exceed 5 x10−810^{-8}10−8 (Figure 2). On the good side, there is really no chance whatsoever that alpha will be exceeded by chance. However, there is also very little chance that we exceed it under the true hypothesis of the associations.

The pink space in Figure 2 is where we spend a lot of our time in genetic studies - where we don’t quite have conclusive power to exceed 5 x10−810^{-8}10−8 , but there is potentially a real signal that we are trying to sort out what we think of.

When we increase effect size, we move the alternative hypothesis distribution to the right (Figure 5). The expected Z-score goes from 3 to 5, and that would give us very reasonable chance to exceed even a p-value of 5x10-8

Increasing the sample size has the same impact on the test statistics as increasing the effect size (Figure 6). This likewise moves that expected test statistic distribution further over.

We spend a lot of time speaking about expected chi-square (χ2\chi^2χ2) or the χ2\chi^2χ2 noncentrality parameter, which is basically just the expected value of the chi-square.

First of all, what is a χ2\chi^2χ2? It is simply a sum of Z-scores, or the sum of random deviates. The traditional 1 degree of freedom chi-square is simply the distribution of Z2Z^2Z2.

This sometimes strikes people as a little bit strange or antiquated, because we have spent the last 20 years telling people not to use the χ2\chi^2χ2 test for association. While it’s true we don’t use it for association, it has a number of fantastic properties that make it very useful in the context of power calculation (Figure 7).

One useful property of the χ2\chi^2χ2 is that it scales linearly with sample size - unlike power, which is some complicated function of whatever threshold you happen to select as being interested in. When you double the sample size of the study, you double the χ2\chi^2χ2 for any truly associated variant.

It has also another really cool property: the reason we use r2r^2r2 in preference to other measures of linkage disequilibrium between sites, is that the expected chi-square of a SNP in LD with a causal variant is simply the product of r2r^2r2 between the two sites and the chi-square at the causal SNP. This provides easy opportunities for exploring the behavior of LD and association statistics together.

Example cases:

Coding variant (R20Q) in a gene SPDL1 protects from all kinds of cancer and confers quite a strong risk to idiopathic pulmonary fibrosis (Figure 8 and 9). **** This is one of the very interesting findings in the first years of FinnGen.

This example is selected because it gives a very nice way of exploring what is the actual practical effect of contamination of controls.

In the variant R20Q there is 3% allele that is protective against all forms of cancer (Figure 9)

First modeling exercise: we plug in a model, and we plug in the prevalence of all cancer and compare what we might expect to see in prostate cancer study, which has 10K cases and 100K non-cancer controls, or 125K all-male controls so that the last 25K represents non-prostate cancer cases that are male and included in the analysis (Figure 11).

What you see here is that these two different models are approaches to studying in the context of prostate cancer resulting in quite different significance and power to exceed 5 x 10−810^{-8}10−8 (Figure 11). Almost ¾ orders of magnitude difference in p-value at this relatively meaningful level, where we actually still care about the p-values because they are neither so large nor so small that we are not interested in the exact number.

That emerges from the fact that by selecting non-cancer cases to controls when cancer prevalence is 20% the control set doesn’t have the population frequency of 3%. It actually has a higher frequency because these are the individuals that apparently had some protection from cancer.

So looking at this you would naturally say, in the case of cancer we certainly want to choose the option that gives us better power, so we would choose the smaller number of non-cancer controls.

However, someone might then point out what if this was a prostate cancer only association, and we would be leaving out 25K controls, and therefore hurting our power for the discovery of other loci.

In this case, because the ratio is already 10 controls for every case, it really makes absolutely no difference whatsoever as you can see from the power and expected chi-square - adding in 25K more pure non-cancer controls with the 100K changes the association statistics negligibly (Figure 12).

This gives us a pointer into the first point of intuition regarding how many times oversampling of controls we actually need and in the context of conventional GWAS.

From the graph below, you can see that you have 5-10 times the number of controls as you do cases, there is really very little advantage in going any further than that by oversampling more controls.

That’s a good rule of thumb to keep in mind. No reason to be concerned if you already have 10-15 times more controls than cases, you are really not going to get any advantage in GWAS by trying to pull in even more controls - you need more cases if you want to boost power.

One percent of contamination in the control rate really doesn't make very much difference in the strength of these associations (Figure 14)

Summary

Control contamination is something relevant to consider when the genetic variants may be shared with a common disease. Hence, there is really no reason to be concerned about your controls having ~1% of cases.

Additionally, increase control to case ratio beyond 10:1 really does not add much to power.

Finally, if you are concerned about computing power, we would suggest using a power calculator to explore specific scenarios yourself

  • And many others out there. There is a plethora of materials and tools in the field of power calculation in genetics !

is a great tool for calculating power (Figure 10). It is developed by Shaun Purcell more than 15 years ago. It allows you to perform these closed-form power calculations in a variety of different experimental designs and settings. All you need to do is enter the parameters of your variant frequency, effect size, sample size, case-control ratio and alpha and beta thresholds of interest.

web based tool

Windows OS based tool

here
Genetic power calculator
http://zzz.bwh.harvard.edu/gpc/
http://csg.sph.umich.edu/abecasis/cats/
Figure 1: Power primer
Figure 4: Stricter significance threshold
Figure 5: What happens if we increase effect size.
Figure 6: Increasing sample size.
Figure 7: Chi square
Figure 8: SPDL1:R20Q - protection from all cancers
Figure 9: Model based on SPDL1:R20Q
Figure 10: Genetic power calculator
Figure 11: Exploring power in context of prostate cancer allele frequency =.03, cancer prevalence 20%
Figure 12: What if the extra 25K controls were actually pure controls (that is, this was prostate only)
Figure 13: Association strength versus control oversampling
Figure 15: Summary
Figure 14: What is contamination of controls was only 1%