How to run an interaction GWAS using the Regenie unmodifiable pipeline

This page describes how to run an interaction GWAS analysis using the FinnGen unmodifiable einteraction Regenie pipeline

Introduction

Unmodifiable pipelines are predefined workflows that cannot be modified by the user. The advantage of unmodifiable pipelines is that your results will be transferred directly to the green library (no download requests needed) because results of unmodifiable pipeline have been verified not to contain any individual-level data.

The purpose of this interaction GWAS pipeline is to test gene-by-environment (GxE) and gene-by-gene (GxG) interactions between your phenotype(s) of interest and genetic variants. If you are not interested in looking at the effects of interaction terms, but instead performing a standard GWAS, please see the page about using the unmodifiable Regenie GWAS pipeline. Unlike the standard unmodifiable Regenie GWAS pipeline, interaction GWAS results are not added to the User results PheWeb browser, though this may change in the future.

This pipeline works in an identical way to the aforementioned unmodifiable Regenie GWAS pipeline, in that you provide a phenotype file, phenotype names and phenotype descriptions. In additional you must provide the name of the interaction term (covariate name or variant ID) and, optionally, the locations of either the interaction covariate (text) file (GxE) or interaction variant bgen+sample files (GxG). These optional files do not need to be provided if 1) the variant (for GxG) is in the imputed data of the release you are using or 2) if the interaction covariate exists in the FinnGen standard covariate file or is provided in the phenotype file. Note that the interaction term is never included in the model-building step (step 1) of the analysis.

For more information about interaction testing in Regenie, please see the relevant Regenie documention page.

Preparing the pipeline inputs

These steps are similar to the data preparation needed for the standard unmodifiable Regenie GWAS pipeline, with additional steps for interaction covariate preparation (if needed).

First, prepare your phenotype file. The phenotype file should contain sample ID columns named FID and IID, and one column for each phenotype included. The FID and IID columns should both contain the FINNGENID, with one line per individual. You may also include additional custom analysis covariates and/or your interaction covariate (GxE) in this file. These covariates cannot have same names as the covariates available in the analysis covariate file.

Second, create a file with phenotype description(s), one per line. The phenotype description is a text description of the phenotypes to be analysed (either all or a subset of the phenotypes in your phenotype file, described above). This single-column text file should have one line per analysed phenotype with a meaningful description of the phenotype. The phenotype descriptions should be in the same order as your list of phenotypes in the workflow's input json (see below). For security reasons, the descriptions can only include alphanumeric characters (a-zA-Z0-9), spaces and some punctuation characters (.,:;-_) and cannot contain tabs.

Third, if you're not using a core FinnGen covariate (GxE) or an R12 imputed variant (GxG) as your interaction variable, then you will need to:

GxE: either a) create a separate (gzipped) tab-delimited text file with columns FID and IID and third column containing your interaction covariate or b) include your interaction covariate as a column in your phenotype file, as mentioned above. For security reasons, the interaction covariate name (and thus its column name) can only contain alphanumeric characters (a-zA-Z0-9), hyphens (-) and underscores (_).
GxG: create a bgen and sample file pair containing your interaction variant. The bgen format needs to be at least v1.2 and the variant name in the "rsid" data block should correspond to the name of the interaction term you provide in workflow's Input json (see below).

Skip this step otherwise.

Fourth, copy your phenotype file, phenotype descriptions file and interaction covariate (gzipped) text or bgen+sample files (if applicable) to a bucket that is accessible in google cloud (locations in /home/ivm/ are not accessible in the cloud). We recommend copying these files to your sandbox's "IVM bucket", which is mapped to the internal location /finngen/red/ but is actually a google cloud bucket gs://fg-production-sandbox-X-red/ where X is your organisation's sandbox number. You can find this bucket location in the file buckets.txt on your sandbox desktop by looking for the line starting "Sandbox ivm bucket". Alternatively, you can run the command echo $RED_BUCKET in your sandbox's terminal and it will print the red bucket path.

To copy the files, open a terminal and navigate to the folder where the files are located. Then run the command gsutil cp phenotypefile.txt phenotypedescriptionfile.txt otherfiles gs://fg-production-sandbox-X-red/myuser/myfolder/ where

phenotypefile.txt and phenotypedescriptionfile.txt are the names of your phenotype and phenotype description files, which can also be gzipped
otherfiles are the names of files that you created in the previous step (gzipped text or bgen+sample) containing the interaction covariate - leave this out if not needed
myuser is typically your username, but can also be another name of your choosing
myfolder is the name of the subfolder that you want to move the files to. If it doesn't already exist, the gsutil command will create it when copying
X is the number of your sandbox

Once copied , make a note of the full bucket path of these files. You can get the full path by running the command gsutil ls gs://fg-production-sandbox-X-red/myuser/myfolder/, which will list the full bucket paths of all files that you copied. The internal pipelines tools also understand the shorter variable name paths, which for the red bucket is SANDBOX_RED. You can use this instead of gs://fg-production-sandbox-X-red in your file paths when editing the pipeline's json file (explained in the next step).

Preparing and submitting the workflow in Pipelines

To standard way to submit this workflow is through the sandbox Pipelines tool. To submit your unmodifiable interaction Regenie analysis, open the Pipelines tool from the Applications menu in your sandbox VM, and then select "Unmodifiable workflow" from the "Create a new job" menu, find "UnmodifiableRegenieInteractionDF12" and select "Create", then scroll down to the "Input JSON" and edit the options (explained below). See Fig. 1 for a visual demonstration of the process.

Fill in the following required json file options for the workflow (ensuring that the double quotes are kept):

regenie_interact_unmod.pheno_file: This field should have the full bucket path of your phenotype file, created in the first step and copied in the third. E.g. gs://fg-production-sandbox-X-red/myuser/myfolder/phenotypefile.txt or SANDBOX_RED/myuser/myfolder/phenotypefile.txt
regenie_interact_unmod.phenolist: A comma-separated list of your phenotypes - these are the same as your phenotype column names in your phenotype file. Example value: "endpoint1,endpoint2"
regenie_interact_unmod.phenodescriptionlist: This should be the full bucket path of the phenotype description file you created in the second step and copied in the third step. E.g. gs://fg-production-sandbox-X-red/myuser/myfolder/phenotypedescriptionfile.txt or SANDBOX_RED/myuser/myfolder/phenotypedescriptionfile.txt
regenie_interact_unmod.interact_type: Choices are "GxE" (default) and "GxG". Any other string will fail the workflow.
regenie_interact_unmod.interact_term: Here you should name the interaction variable (GxE) or variant (GxG).
- If interact_type is "GxE", this should either be the name of a column in your phenotype or optional interaction covariate file or the name of an existing FinnGen covariate. If your interaction variable is categorical, the base category should be provided in square brackets after the name of the variable (e.g. SEX_IMPUTED[0]).
- If interact_type is "GxG", this should be the variant ID in the format chrCHR_POS_REF_ALT where CHR is in the range 1-22 or X, POS in the build 38 variant position and REF and ALT are the reference and alternate alleles. The default GxG interaction test is additive, but dominant, recessive and categorical tests can instead be chosen. You can specify the test in square brackets after the variant ID, e.g. chrCHR_POS_REF_ALT[dom/rec/cat].

The remaining json options can are conditional on the type of analysis you are running:

regenie_interact_unmod.covariates: These are the covariates included in the analysis, as a comma-separated list. By default, this field contains the covariates used in core analysis. The interact covariate should not be included in this list (the pipeline will fail if it is also included in this list).
regenie_interact_unmod.test: This should be either "additive", "recessive" or "dominant", depending on the type of base Regenie GWAS you want. Note that this is not related to the test for the interaction variant (GxG) - that can be specified when providing the interaction variant ID.
regenie_unmod.is_binary: This input should be set to "true" if your endpoint is a binary case-control endpoint, and "false" if it is quantitative.
regenie_interact_unmod.interact_covfile: This should be the bucket location of your interaction covariate (gzipped) text file. Leave this option as the default value (empty_file) if either
- your interact_type is "GxG" or
- your interact_type is "GxE" and your interaction covariate is provided elsewhere (i.e. in the phenotype file or the default FinnGen covariate file)
regenie_interact_unmod.interact_bgen: This is the bucket location of your interaction variable bgen file. Leave this option as the default value (empty_file) if either
- your interact_type is "GxE" or
- your interact_type is "GxG" and you are naming a variant from the imputed R12 genotypes
regenie_interact_unmod.interact_sample: Similarly to above, this is the bucket location of the sample file that corresponds to your bgen file. You must provide this sample file location if you have given a bgen file location for the interact_bgen variable. Leave this option as the default value (empty_file) if you did not specify a bgen file in the interact_bgen option.
regenie_interact_unmod.interact_extra_flags: This option allows you to specify additional interaction options in Regenie. Accepted flags are --interaction-file-reffirst, --no-condtl, --force-condtl and --rare-mac X (where X is an integer). The default is only --interaction-file-reffirst, which assumes the user-provided bgen file has the first allele as reference (true for FinnGen R12 imputed data). If you know this not to be the case, remove this flag and leave this variable as an empty string (i.e. "")

Once the json options have been edited, click "Submit" and navigate to the submitted job list by clicking on "Show pipeline jobs" under the "Submitted job" header on the front page of the Pipelines tool. Your Regenie interaction GWAS job should appear at the top of the list with the Name as "regenie_interact_unmod" (hopefully in the "Running" state). Make a note of the (job) ID, as you will need it to download your results.

Submitting the workflow on the command line (no green library upload)

If you wish to edit the pipeline functionality or avoid the results being uploaded to the green library (available for all users), then you can submit this workflow from the command line in the terminal. To do this, follow the steps above, but instead of finding "UnmodifiableRegenieInteractionDF12" and selecting "Create", click "Download". Then:

save the zip file and unzip it to your desired location, either using file manager or the terminal emulator
use a text editor to edit the options in the unzipped inputs.json as described above
in the terminal, navigate to the location of the unzipped (and edited) workflow files
run the command finngen-cli request-workflow -w UnmodifiableRegenieInteractionDF12-1.0.wdl -i inputs.json -d subwdls.zip
check the Pipelines tool to find the workflow ID and see the progress of your job

Accessing the GWAS results (submitted through Pipelines)

On completion (job state "Succeeded"), the results will be automatically copied to the green library bucket specific for each data release: /finngen/library-green/finngen_R[RELEASE]/unmodifiable_pipelines/UnmodifiableRegenieInteractionDF[RELEASE]/[WORKFLOWID] where [RELEASE] is the data freeze number (e.g. 12). E.g. for R12, to /finngen/library-green/finngen_R12/unmodifiable_pipelines/UnmodifiableRegenieInteractionDF12/[WORKFLOWID]. The [WORKFLOWID] will be the pipelines app ID for your job.

Alternatively, the results uploaded to green library can also be accessed outside the sandbox by navigating (using a web browser) to https://console.cloud.google.com/storage/browser/finngen-production-library-green/finngen_RX/unmodifiable_pipelines/UnmodifiableRegenieDFX where X is the FinnGen data freeze you ran the GWAS for (e.g. 12). You will need to log in with the same google account with which you access the FinnGen sandbox. In this page, look for the folder corresponding to your GWAS job ID and open it to access the GWAS output files.

Accessing the GWAS results (submitted from the command line using finngen-cli)

When submitted from command line, analysis results will not be copied to the green library. Instead you will need to navigate to the pipelines browser, find your job in the list of your submitted pipeline jobs and open the job's information by clicking on it. In the "info" tab, you should then see the file paths of all output files for the workflow. You can then copy these to your desired location, using either the File Manager or in the terminal

PreviousHow to run GWAS using the Regenie unmodifiable pipeline NextHow to run survival analysis using GATE unmodifiable pipeline

Last updated 6 months ago

Was this helpful?