# How to run colocalization pipeline

## Introduction

This pipeline takes the outputs from our finemapping pipeline, and perform colocalization among 571 resources we gathered, including all GWAS endpoints from FinnGen, UKB, eQTL catelogue, Generisk project, proteomics study from INTERVAL, UKB and FinnGen.&#x20;

| Data Source                         | Data type         | Description                                                                                                                                                                           |
| ----------------------------------- | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| FinnGen-R12                         | GWAS              | all endpoints from FinnGen R12                                                                                                                                                        |
| GeneRisk                            | GWAS              | GeneRISK Study is an ongoing prospective observational study focusing on genetic risk factors of cardiovascular diseases and on utilizing genetic information in preventing diseases. |
| UKB-finucane                        | GWAS              | Some endpoints from UKB shared from Masahiro. <https://www.medrxiv.org/content/10.1101/2021.09.03.21262975v1>                                                                         |
| Alasoo\_2018--macrophage\_naive--ge | eQTL\_Catalogue   | expression QTL from eQTL catalogue (release 6), gathered from macrophage and based on gene expression, see eQTL catelogue website for more information                                |
| ... (other \~560  more items)       | eQTL\_Catelogue   | Other resources from eQTL Catelogue indicated by the data source.  eQTL catelogue assembled multiple data sources, e.g., tissue expression from GTEX.                                 |
| INTERVAL                            | Plasma-Proteomics | Proteomics QTL from INTERVAL                                                                                                                                                          |
| UKB-PPP                             | Plasma-Proteomics | Proteomics QTL from UKBiobank (Olink)                                                                                                                                                 |
| FIN-R12-Olink                       | Plasma-Proteomics | Proteomics QTL from FinnGen R12 (Olink)                                                                                                                                               |
| FIN-R12-Somascan                    | Plasma-Proteomics | Proteomics QTL from FinnGen R12 (Somascan)                                                                                                                                            |

## Example to run&#x20;

1. Download the meta data from finemapping pipeline.

Menu(Applications) -> Sandbox -> pipelines and find your successful finemapping run -> click download metadata (assumed to be located in Downloads/XXXX\_metadata.json)

<figure><img src="https://3072695768-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MhYL0UTLjqsuIdK0SSO%2Fuploads%2FXucp1IUmBJ8zW7LvdCIJ%2F1.%20download.png?alt=media&#x26;token=ccd5bafd-7c12-4eb6-8cb8-f4f790f0fcca" alt=""><figcaption></figcaption></figure>

2. Submit the colocalization job in local terminal within the **same sandbox number that the finemapping pipeline** **was run in**:

```bash
# run the script: metadata, trait_name, data_type, storage bucket (your green bucket)
# please customize those inputs to your own project and data_type can be any string wihout space)
# please change the red bucket number "N" to match your sandbox environment, you can see the red bucket uri by running "gsutil ls" in SB terminal 
bash /finngen/library-green/scripts/coloc/submit ~/Downloads/XXXX_metadata.json T2D GWAS gs://fg-production-sandbox-"N"-red/YOUR_PATH/T2D_Project
```

Check the errors if there are some.&#x20;

If no error occurs, pressing the Enter key at the terminal will open a browser to check the jobs. Refresh and look into your submitted job. The job is named "ColocSusieDirectMulti" with your user name,  it takes some time to show due to reponse time for the backends in the sandbox.&#x20;

3. Download results

The outputs are labeled as "ColocSusieDirectMulti.colocQC" in output of pipeline's job details.  We only keep the H4.PP > 0.5 and valid credible set from both dataset (the threshold could be controled in the input).  Future filtering should be performed based on your purpose to this output, e.g., H4.PP > 0.8 and overlapped region size.  We could not provide a gold standard for this, as it is dependent on the study design and the aim for colocalization. &#x20;

The raw results are listed in the "ColocSusieDirectMulti.coloc" without any filtering and merging.

"ColocSusieDirectMulti.hit":  all the information for the top signals in the full colocalization results.&#x20;

&#x20;"ColocSusieDirectMulti.pairs":  the overlapped region being run in the workflow.&#x20;

## Output formats

| Column       | Description                                                                                                                                                                                                                                                                                                              |
| ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| dataset1     | generated from your trait\_name and data\_type                                                                                                                                                                                                                                                                           |
| dataset2     | Study--DataType in our resources                                                                                                                                                                                                                                                                                         |
| trait1       | the trait name in your data                                                                                                                                                                                                                                                                                              |
| trait2       | trait name / molecular phenotype name from our resources                                                                                                                                                                                                                                                                 |
| region1      | region in your data                                                                                                                                                                                                                                                                                                      |
| region2      | overlapped region in our resources                                                                                                                                                                                                                                                                                       |
| cs1          | credible set in your data                                                                                                                                                                                                                                                                                                |
| cs2          | credible set in our resources                                                                                                                                                                                                                                                                                            |
| nsnps        | total variants overlapped                                                                                                                                                                                                                                                                                                |
| hit1         | top signal in your data                                                                                                                                                                                                                                                                                                  |
| hit2         | top signal in our resources                                                                                                                                                                                                                                                                                              |
| PP.H4.abf    | probability of colocalization between your data and our resources                                                                                                                                                                                                                                                        |
| low\_purity1 | the credible set is low purity or not in your data. (1 means low purity, 0, high purity)                                                                                                                                                                                                                                 |
| low\_purity2 | the purity in our resources                                                                                                                                                                                                                                                                                              |
| nsnps1       | number of variants in region from your data                                                                                                                                                                                                                                                                              |
| nsnps2       | number of variants in region from our resources                                                                                                                                                                                                                                                                          |
| cs1\_log10bf | log10 bayes factor for the credible set in your data                                                                                                                                                                                                                                                                     |
| cs2\_log10bf | log10 bayes factor for the credible set in our resources                                                                                                                                                                                                                                                                 |
| clpp         | colocalization based on CLPP                                                                                                                                                                                                                                                                                             |
| clpa         | colocalization based on CLPA (min of PIP)                                                                                                                                                                                                                                                                                |
| cs1\_size    | size of the raw credible set in your data                                                                                                                                                                                                                                                                                |
| cs2\_size    | size of the raw credible set in our resources                                                                                                                                                                                                                                                                            |
| cs\_overlap  | size of the overlapped credible set                                                                                                                                                                                                                                                                                      |
| topInOverlap | Indicator if a top variant (highest PIP) in each dataset is in the overlap region of finemapped regions of the 2 datasets. 1,1: both orginal top signal located in the overlapped region (expected reasonable coloc);  1,0 /0,1: only one top in the overlapped region;  0,0: both top signal are not in the overlapped. |
| hit1\_info   | information of top signal in your data (beta, p-value)                                                                                                                                                                                                                                                                   |
| hit2\_info   | information of top signal in our resources (beta, p-value)                                                                                                                                                                                                                                                               |

Codes are available on github: <https://github.com/FINNGEN/coloc.susie.direct>

### Running colocalization against your own phenotype(s)

If you want to run colocalization between two of your defined phenotypes or groups of phenotypes instead of the pre determined ones, here are step-by-step instructions to do that.&#x20;

* Run separate finemap runs for both of your sets of phenotypes (Info1 and Info2)
* Download both of the succeeded finemap runs metadata files
* Create regions and other coloc files for both sets by running the following scripts (set the output folder path under /finngen/red, so the pipelines can access the files in there):

&#x20;`bash /finngen/library-green/scripts/coloc/grabRegionFinemap.sh /path/to/metadata1.json NAME_OF_SET1 GWAS gs://fg-production-sandbox-"N"-red/YOUR_PATH/FOLDER1` and

`bash /finngen/library-green/scripts/coloc/grabRegionFinemap.sh /path/to/metadata2.json NAME_OF_SET2 GWAS gs://fg-production-sandbox-"N"-red/YOUR_PATH/FOLDER2`&#x20;

You should now have NAME\_OF\_SET1\_GWAS.txt and NAME\_OF\_SET2\_GWAS.txt in the folders accompanied by Coloc.regions.tsv and Coloc.map.txt files.

* Create json for coloc (example template /finngen/library-green/scripts/coloc/wdl/tests.json), and put path to NAME\_OF\_SET1\_GWAS.txt and NAME\_OF\_SET2\_GWAS.txt there into  "ColocSusieDirectMulti.colocInfo1" and "ColocSusieDirectMulti.colocInfo2" parameters
* Submit coloc pipeline using wdl (gs\://finngen-production-library-green/scripts/coloc/wdl/colocSusieDirectMulti.wdl) + sub-wdl (gs\://finngen-production-library-green/scripts/coloc/wdl/colocSusiePair.zip) + your edited json using finngen-cli command:

`finngen-cli rw /finngen/library-green/scripts/coloc/wdl/colocSusieDirectMulti.wdl -d /finngen/library-green/scripts/coloc/wdl/colocSusiePair.zip -i /path/to/edited/tests.json`
