How to run GWAS using REGENIE

What?

Pipeline for running GWAS (for binary or quantitative phenotype) using regenie.

Introduction

As of DF7, all FinnGen core endpoint analyses switched to using REGENIE. REGENIE is basically an improved version of SAIGE with a few added computational efficiences and better effect size estimates, and is thus recommended to be used unless you specifically want to run analyses similar to those in FinnGen releases 1-6 (in which case, see How to run GWAS using SAIGE).

Note: The REGENIE pipeline can also be run using custom GWAS tools and initiated directly from the Cohort Operations tool. In addition to additive model, also recessive and dominant analysis are available in Custom GWAS CLI. From Sandbox update 10.2 onwards also binary and quantitative phenotype analyses are available in Custom GWAS CLI.

!! NB !! Please be cautious with how many GWAS you create and the number of phenotypes you include. If you are going to launch more than 5 GWASs or GWAS with tens of phenotypes please contact the [email protected] so that we can temporarily increase the resources of your organization's Sandbox and downscale afterward. After resources have been increased, we recommend that you would run a single GWAS job every 30 minutes (in a bash script you can use ‘sleep 30m’ in your loop) such that you would run two phenotypes in an hour allowing you to run ~40 jobs in 24 hours. This helps avoid jamming the process and permits other users in your organization to use your organization’s pipeline.

Quick (and easy) method for DF12: Unmodifiable REGENIE pipeline

If you plan to run a GWAS in DF12 (R12) and will use the standard FinnGen covariates and default REGENIE settings, we have created an "unmodifiable" version of the REGENIE pipeline that is easier to use and has the benefit that your results will automatically be added to PheWeb and be accessible in the green library, avoiding the need to place a download request in order to access these results outside of the sandbox environment. Please see usage instructions on this page.

If you plan to run your REGENIE GWAS in other data freezes, use non-default REGENIE settings and/or different covariates than the FinnGen standard ones, the unmodifiable pipeline is not suitable and you should continue following the instructions below.

Quick method for R12: Modifiable REGENIE pipeline

If you plan to run a GWAS in DF12, but need to use non-default REGENIE settings, different covariates, or even different genotypes than the standard ones (e.g. HLA alleles), this is a great option to do that. Note that due to the modifiable nature of the pipeline, the results are not automatically exported to green library. To run the modifiable pipeline, you can navigate to the Modifiable workflows in Pipelines tool, select Regenie DF12, and fill in the inputs according to the instructions below.

Example files for running the REGENIE pipeline

You can find the example files (.wdl(s) + json) for running REGENIE in Sandbox from:

/finngen/library-green/scripts/regenie/:

.json files (needs to be edited!):
- regenie_example_R9.json
- regenie_example_R10.json
- regenie_example_R11.json, and
- regenie_example_R12.json
.wdl file: regenie.wdl *
sub-.wdl files as one zipped file: regenie_sub_wdl.zip*

These are examples to help you understand how to run REGENIE, using the endpoint J10_ASTHMA_EXMORE in DF9 (regenie_example_R9.json), in DF10 (regenie_example_R10.json), DF11 (regenie_example_R11.json) and DF12 (regenie_example_R12.json).

*NOTE: all the files (wdl's and json- files) were updated in April 2023 (see User's meeting recording from April 2023), and json- files that have been used before that may not work with the current wdl.

*NOTE: The example below is listed as LIBRARY_RED, but you will not be able to write your custom files there. You will need to use the tag SANDBOX_RED and upload your files to /finngen/red using gsutil.

*NOTE: The phenotype file must be gzipped.

Covariate + phenotype file

You may use some or all of the default covariates or add new covariates. If you like to make a covariate to the REGENIE run please follow the instructions on how to make a covariate + phenotype file for GWAS pipeline.

Prepare your files for REGENIE

Before you can submit your job, you need to download example files needed, and edit the .json file, that looks like this:

The parts you should edit in the .json- file are highlighted in the figure, and are:

regenie.phenolist: the path to a phenotype list file. A phenotype list file is a text file with each row representing a phenotypic trait (similar to SAIGE), for example:

I9_CHD
T1D_WIDE

(Note: Multiple correlated phenotypes with missing values of less than 5% can be grouped as a single row separated by a tab in the file. However, we still recommend running each phenotype separately.)

regenie.cov_pheno: the path to a phenotype-covariate file. The pheno-covariate file is a tab- separated (possibly gzipped) .txt file containing all phenotype and covariate columns. The first two columns of the file should be FID and IID. Please provide the same sample ID in both columns:

FID    IID
FGID1    FGID1
FGID2    FGID2
FGID3    FGID3

NB: Make sure that there are no spaces in the pheno-covariate file!

regenie.covariates: List of covariate column names, separated by column; for example: "age,gender". NOTE: In the example .json file (regenie_example_R9.json) there are already defined covariates used in the R9 core GWAS: age, sex, genotyping batch eand PC1-10.
regenie.is_binary: true if your phenotype is binary (e.g. case-control), false if quantitative (e.g. BMI). Defines whether to run a logistic or linear model. See another example from Running quantitative GWAS with REGENIE.

If you want to run recessive, or dominant model, you also need to edit:

regenie.sub_step2.step2.test: defines the association model type (additive, recessive or dominant) used in the GWAS. In the example model it is additive ("normal" GWAS). Unless you are specifically running a recessive or dominant model, there is no need to change this setting.

Logistic or linear?

In REGENIE, you'll define whether to use a logistic or linear model by setting in the .json file regenie.is_binary as true for a logistic model and false for a linear model, for binary and continuous traits respectively. If running a REGENIE model for a quantitative trait, you can also use this example.

Submit your REGENIE job

If you're running REGENIE using Sandbox Pipelines, it's a good idea to first read the sections Pipelines is based on Cromwell and WDL, How to use the Pipelines tool and How to submit a pipeline from the command line.

Using command line

Once your files are in order, you can submit your run by typing the following command in the FinnGen terminal:

finngen-cli rw -w /path/to/regenie.wdl \
                -i /path/to/your.json \
                -d /path/to/regenie_sub_wdl.zip

REMEMBER to save your job ID [WORKFLOW_ID]to keep track of your job and to be able to view the output! See also tips on how to find a pipeline job ID. The [WORKFLOW_ID] and your job can be monitored from the pipelines:

Output

Once your job is successfully done, you can find your output files from: /finngen/pipeline/cromwell/workflows/regenie/[WORKFLOW_ID]/call-sub_step2/shard-#/sub.regenie_step2/[SUBWORKFLOW_ID]/call-gather/shard-#/

Running Regenie (R7) in the Sandbox was presented in User Meeting 24th of August 2021

Some things to consider:

Make sure that all files you have edited yourself, our edited phenotype-covariate file is in /finngen/red/. Note that for copying files to /finngen/red/, you need to use gsutil and gs://fg-production-sandbox-<NO>-red/ path for /finngen/red.
Bucket paths in the .json file need to follow the form proposed in buckets.txt when specifying the inputs (e.g. for the modified .json file).
Make sure that you are using the latest version of REGENIE in regenie.sub_step1.step1.docker and regenie.sub_step2.docker.
[pheno].gz is the preferred summary statistics format (not [pheno].regenie.gz) for downstream analyses, e.g. finemapping.

See how the Sandbox paths and pipelines are mapped here.

Related:

If your pipeline job fails

PreviousHow to run genome-wide association studies (GWAS)NextRunning quantitative GWAS with REGENIE

Last updated 3 months ago

Was this helpful?