How to run GWAS using REGENIE
Last updated
Was this helpful?
Last updated
Was this helpful?
Pipeline for running GWAS (for binary or quantitative phenotype) using regenie.
As of DF7, all FinnGen core endpoint analyses switched to using . REGENIE is basically an improved version of SAIGE with a few added computational efficiences and better effect size estimates, and is thus unless you specifically want to run analyses similar to those in FinnGen releases 1-6 (in which case, see ).
Note: The REGENIE pipeline can also be run using and initiated directly from the tool. In addition to additive model, also recessive and dominant analysis are available in . From Sandbox update 10.2 onwards also and phenotype analyses are available in.
!! NB !! Please be cautious with how many GWAS you create and the number of phenotypes you include. If you are going to launch more than 5 GWASs or GWAS with tens of phenotypes please contact the so that we can temporarily increase the resources of your organization's Sandbox and downscale afterward. After resources have been increased, we recommend that you would run a single GWAS job every 30 minutes (in a bash script you can use ‘sleep 30m’ in your loop) such that you would run two phenotypes in an hour allowing you to run ~40 jobs in 24 hours. This helps avoid jamming the process and permits other users in your organization to use your organization’s pipeline.
If you plan to run a GWAS in DF12 (R12) and will use the standard FinnGen covariates and default REGENIE settings, we have created an "unmodifiable" version of the REGENIE pipeline that is easier to use and has the benefit that your results will automatically be added to PheWeb and be accessible in the green library, avoiding the need to place a download request in order to access these results outside of the sandbox environment. Please see usage instructions on .
If you plan to run your REGENIE GWAS in other data freezes, use non-default REGENIE settings and/or different covariates than the FinnGen standard ones, the unmodifiable pipeline is not suitable and you should continue following the instructions below.
If you plan to run a GWAS in DF12, but need to use non-default REGENIE settings, different covariates, or even different genotypes than the standard ones (e.g. HLA alleles), this is a great option to do that. Note that due to the modifiable nature of the pipeline, the results are not automatically exported to green library. To run the modifiable pipeline, you can navigate to the Modifiable workflows in Pipelines tool, select Regenie DF12, and fill in the inputs according to the instructions below.
You can find the example files (.wdl(s) + json) for running REGENIE in Sandbox from:
/finngen/library-green/scripts/regenie/
:
.json files (needs to be edited!):
regenie_example_R9.json
regenie_example_R10.json
regenie_example_R11.json
, and
regenie_example_R12.json
.wdl file: regenie.wdl *
sub-.wdl files as one zipped file: regenie_sub_wdl.zip*
These are examples to help you understand how to run REGENIE, using the endpoint J10_ASTHMA_EXMORE
in DF9 (regenie_example_R9.json
), in DF10 (regenie_example_R10.json
), DF11 (regenie_example_R11.json
) and DF12 (regenie_example_R12.json
).
*NOTE: all the files (wdl's and json- files) were updated in April 2023 (see User's meeting recording from April 2023), and json- files that have been used before that may not work with the current wdl.
*NOTE: The example below is listed as LIBRARY_RED, but you will not be able to write your custom files there. You will need to use the tag SANDBOX_RED and upload your files to /finngen/red using gsutil.
*NOTE: The phenotype file must be gzipped.
The parts you should edit in the .json- file are highlighted in the figure, and are:
regenie.phenolist:
the path to a phenotype list file. A phenotype list file is a text file with each row representing a phenotypic trait (similar to SAIGE), for example:
(Note: Multiple correlated phenotypes with missing values of less than 5% can be grouped as a single row separated by a tab in the file. However, we still recommend running each phenotype separately.)
regenie.cov_pheno:
the path to a phenotype-covariate file. The pheno-covariate file is a tab- separated (possibly gzipped) .txt file containing all phenotype and covariate columns. The first two columns of the file should be FID and IID. Please provide the same sample ID in both columns:
NB: Make sure that there are no spaces in the pheno-covariate file!
regenie.covariates:
List of covariate column names, separated by column; for example: "age,gender"
. NOTE: In the example .json file (regenie_example_R9.json
) there are already defined covariates used in the R9 core GWAS: age, sex, genotyping batch eand PC1-10.
If you want to run recessive, or dominant model, you also need to edit:
Once your files are in order, you can submit your run by typing the following command in the FinnGen terminal:
Once your job is successfully done, you can find your output files from: /finngen/pipeline/cromwell/workflows/regenie/[WORKFLOW_ID]/call-sub_step2/shard-#/sub.regenie_step2/[SUBWORKFLOW_ID]/call-gather/shard-#/
Bucket paths in the .json file need to follow the form proposed in buckets.txt
when specifying the inputs (e.g. for the modified .json file).
Make sure that you are using the latest version of REGENIE in regenie.sub_step1.step1.docker
and regenie.sub_step2.docker
.
[pheno].gz is the preferred summary statistics format (not [pheno].regenie.gz) for downstream analyses, e.g. finemapping.
Related:
You may use some or all of the default covariates or add new covariates. If you like to make a covariate to the REGENIE run please follow the instructions on .
Before you can submit your job, you need to download needed, and edit the .json file, that looks like this:
regenie.is_binary:
true
if your phenotype is binary (e.g. case-control), false
if quantitative (e.g. BMI). Defines whether to run a l. See another example from .
regenie.sub_step2.step2.test:
defines the (additive
, recessive
or dominant
) used in the GWAS. In the example model it is additive
("normal" GWAS). Unless you are specifically running a recessive or dominant model, there is no need to change this setting.
In REGENIE, you'll define whether to use a logistic or linear model by setting in the .json file regenie.is_binary
as true
for a logistic model and false
for a linear model, for binary and continuous traits respectively. If running a REGENIE model for a quantitative trait, you can also use .
If you're running REGENIE using Sandbox Pipelines, it's a good idea to first read the sections , and .
REMEMBER to save your job ID [WORKFLOW_ID]
to keep track of your job and to be able to view the output! See also tips on . The [WORKFLOW_ID]
and your job can be monitored from the pipelines:
Running Regenie (R7) in the Sandbox was presented in
Make sure that all files you have edited yourself, our edited phenotype-covariate file is in /finngen/red/
. Note that for copying files to /finngen/red/
, you need to use and gs://fg-production-sandbox-<NO>-red/
path for /finngen/red
.
See how the Sandbox paths and pipelines are mapped .