Unmodifiable Finemapping pipeline

How to run finemapping with SuSiE and FINEMAP with the unmodifiable pipeline

Unmodifiable pipelines are predefined workflows that cannot be modified by the user. The advantage of running unmodifiable pipelines compared to modifiable pipelines is that you will get results directly to the green library and the User results PheWeb browser. No download requests are needed, because results of unmodifiable pipeline have been verified not to contain any individual-level data. Running the finemapping unmodifiable pipeline is very similar to running finemapping in the modifiable pipelines, with some small restrictions. The unmodifiable finemapping pipeline can be accessed in the sandbox from The pipelines app -> unmodifable workflow -> Unmodifiable Finemap DF12 (or DF13). For more information about the Pipelines tool, check the Pipelines tool documentation.

Finemapping

As GWAS summary statistics do not provide information about which variants are the causal variants in any given region, this pipeline can be used to help you identify both the likely causal variant(s) and credible sets that are 95% and 99% likely to contain the causal variant. For more information about finemapping in general, see Finemapping and for information about finemapping results files, see Finemapping results format.

Inputs to change

Once you have found the Unmodifiable Finemap DF12/DF13 in the pipelines tool, scroll down to the input json section to add your inputs. The workflow has three required inputs in its input json and the rest are optional.

The required inputs are:

finemap.sumstats_pattern: A pattern for you summary statistics location. If your endpoint results files were located in SANDBOX_RED/user/ as endpoint_A.gz, endpoint_B.gz, etc., the pattern would be SANDBOX_RED/user/{PHENO}.gz. The summary statistics must be in the standard FinnGen summstats format (described below). If you have used the unmodifiable Regenie pipeline to perform your initial GWAS, either in Pipelines tool or CohortOperations, your results will automatically be uploaded to the green library. In this case, the pattern would be gs://finngen-production-library-green/finngen_RX/sandbox_custom_gwas/{PHENO}/{PHENO}.gz or (using the Pipeline mappings form) LIBRARY_GREEN/finngen_RX/sandbox_custom_gwas/{PHENO}/{PHENO}.gz where X is the relevant finngen release for your GWAS run (e.g. 12). Note: The curly braces {} should not be removed, as they help the workflow to identify the correct location to replace.
finemap.phenolistfile: A plaintext file containing only the endpoint names, one per line. These endpoint names should match the phenotype names from your GWAS runs; the pipeline assumes that each endpoint's GWAS summary statistics file can be found by replacing {PHENO} in the "finemap.sumstats_pattern" you provided above with each endpoint's name. This file will need to be copied to your organisation's red bucket (see Copying files to your organisation's red bucket).
finemap.phenotypes: A tab-delimited phenotype file containing endpoints to analyze, with columns "FID", "IID" (both FINNGENID) and then one column per phenotype, named as your phenotype. The file should be the same phenotype file used for the original GWAS scan and should, ideally, be located in your organisation's red bucket. If you created your phenotype file and ran the GWAS using Cohort Operations, you should set this variable as CUSTOM_GWAS/[workflow_ID]/[PHENO].tsv where [workflow_ID] is the GWAS's job ID from the pipeline's tool and [PHENO] is the name of your phenotype.

In case you want to select the regions yourself instead of using automatic region selection, the following input will also have to be filled:

finemap.bed_regions_file: A plaintext file containing the bukcet paths to the region definitions for each of the endpoints. This file format is described below.

The remaining inputs are parameters to the analysis. In most cases, there is no need to adjust these parameters. Most of the below parameters control automatic region selection.

finemap.preprocess.scale_se_by_pval: Whether to scale standard error by p-value. This will affect the finemapping results.
finemap.preprocess.x_chromosome: Whether to include x chromosome or not. True by default.
finemap.preprocess.window: The default finemapping window that is extended around genome-wide significant variants. The window is extended in both directions, meaning the default area that is finemapped around a significant variant is 3MB ( 1_500_000 basepairs x2). Overlapping regions will be merged. Please note that larger values will increase the computational resource usage, and can result in regions too large to finemap.
finemap.preprocess.max_region_width: Maximum region size in automatic region selection. If the region selection produces larger regions by merging multiple regions, the region selection is tried again with a smaller window extended around each significant variant. Please note that larger values might result in regions that are too large to finemap.
finemap.preprocess.window_shrink_ratio: Value to shrink window size with in case of too large regions. If the automatic region selection process encounters too large regions (larger than finemap.preprocess.max_region_width), the too large region has new windows extended from each significant variant, with each window being scaled by the window shrink ratio.
finemap.set_variant_id_map_chr: Map chrX (or other chromosomes) to non-numeric chromosomes for the benefit of the pipeline.
finemap.preprocess.p_threshold: Threshold for genome-wide significance. Making this larger will increase amount of regions to finemap.
finemap.ldstore_finemap.n_causal_snps: Maximum amount of causal variants in a region. Finemapping will be able to identify N or less separate causal variants in a region. Note that increasing this will increase the resource usage of the finemapping algorithms.
finemap.ldstore_finemap.susie.min_cs_corr: A "purity" threshold for the credible sets. Any credible set that contains a pair of variables with correlation less than this threshold will be filtered out and not reported.
finemap.ldstore_finemap.filter_and_summarize.good_cred_r2: A "purity" threshold for the credible sets. Any credible set with minimum r2 correlation between the variants under this threshold will be considered a low-quality credible set.
finemap.ldstore_finemap.ldstore.enable_fuse: Enable GCS fuse. If fuse is supported by cromwell, it will reduce the amount of data that needs to be localized, reducing the amount of time spent in finemapping tasks. Current configuration of cromwell and backend (BATCH) do not support GCS fuse.

Input summary statistics format

For finemapping, your summary statistics files need to be in the standard FinnGen GWAS results format, which is the standard output format for the FinnGen REGENIE and SAIGE GWAS pipelines. Your summary statistics files must contain the following columns:

chromosome column: "#chrom"
position column: "pos"
reference allele column: "ref"
alternate allele column: "alt"
alternate allele frequency column: "af_alt"
effect size column: "beta"
std error of effect column: "sebeta"
p-value column: "pval"

Custom region file

The custom region file is a file that should contain as many lines as there are endpoints in your finemapping job. Each of those lines should contain a google storage file path to the region definition file corresponding to the endpoint on the same line number in phenolistfile. It is very important that the order of endpoints in the phenolistfile is exactly the same as the order of the bed regions file.

For example, if you had two endpoints to finemap, and you wanted to select the regions to finemap, your "finemap.phenolistfile" might look like this:

ENDPOINT_A
ENDPOINT_B

In that case, your "finemap.bed_regions_file" should look something like this:

gs://your-bucket/path-to-ENDPOINT_A-regions.txt
gs://your-bucket/path-to-ENDPOINT_B-regions.txt

Region file format

Each of those region files should be a bed file, that is a file with one genomic region per line. The region format consists of (numeric) chromosome, region start in basepairs, and region end in basepairs, separated with spaces. For example, if you wanted to finemap basepairs 1,000,000-4,000,000 in chromosome 1, and basepairs 55,000,000-57,000,000 in chromosome X, the bed region file for that endpoint would look like the following:

1 1000000 4000000
23 55000000 57000000

Outputs

Output file locations

The results will be automatically copied to the green library bucket specific for each data release: /finngen/library-green/finngen_R[RELEASE]/unmodifiable_pipelines/UnmodifiableFinemapDF[RELEASE]/workflow_id

E.g. for R12, to /finngen/library-green/finngen_R12/unmodifiable_pipelines/UnmodifiableFinemapDF12/workflow_id .

The workflow_id will be the pipelines app ID for your job. As an unmodifiable pipeline, your results will also automatically be transferred to the green library and made accessible outside the sandbox, see Accessing green data.

Output file formats

The formats of the finemapping pipeline are described on the Finemapping results format page.

PreviousHow to run finemapping pipeline NextModifiable Finemapping pipeline

Last updated 5 months ago

Was this helpful?