> For the complete documentation index, see [llms.txt](https://docs.finngen.fi/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.finngen.fi/working-in-the-sandbox/running-analyses-in-sandbox/how-to-run-ldsc-pipeline.md).

# How to run the LDSC unmodifiable pipeline

## Introduction

The LDSC pipeline is used for calculating [heritabilities](/background-reading/heritability-and-genetic-correlations.md#heritability) and [genetic correlations](/background-reading/heritability-and-genetic-correlations.md#genetic-correlation) for disease endpoints using [ldsc](https://github.com/bulik/ldsc). The complete documentation for the pipeline can be found in [github](https://github.com/FINNGEN/LDSC).

You can find the pipeline as `UnmodifiableGeneticCorrelationLDSCDF[RELEASE]"` in the unmod pipeline page in sandbox.

## How it works

Summary:

* Start with **metadata tables** describing your GWAS summary statistics.
* The pipeline splits these into **chunks** for parallel processing.
* Each chunk runs a premunging step (preprocessing/mapping), then `munge_ldsc` (munging + heritability calculation).
* Results are gathered with `gather_h2` (heritabilities).
* If `only_het` is false, **pairwise combinations** are built with `return_couples` and correlations are computed in parallel using `multi_rg`, with summary outputs collected at the end.

<figure><img src="/files/rLjIVgQnPFlgjdHEWlyC" alt=""><figcaption></figcaption></figure>

## Inputs

The pipeline works in two modes conceptually, either by calculating cross correlations within a single list or between two lists. The pipeline requires to always provide two input lists via `meta_fg` and `meta_other`.

* If the same list is passed twice the pipeline will calculate `N*(N-1)/2` correlations with the list
* If two different lists are passed then `N*M` correlations will be calculated across the two list

The advantage of this setup is that now cache calling is improved as the internal munging required by LDSC is done only once, speeding up operations.\
\
Here are all the required inputs for the pipeline

| Parameter                    | Description                                                                                                                                                                                       |
| ---------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `ldsc_rg.only_het`           | If true, computes only heritabilities (default: false, will also compute genetic correlations).                                                                                                   |
| `ldsc_rg.meta_fg`            | Path to the primary metadata table (TSV) with summary statistics info.                                                                                                                            |
| `ldsc_rg.meta_other`         | Path to the comparison/secondary metadata table (for cross-trait analyses). If one wants to calculate all correlation across one list this file should match `input_ss`                           |
| `ldsc_rg.name`               | Prefix for output files.                                                                                                                                                                          |
| `ldsc_rg.population`         | LD reference population key. "fin" or "eur"                                                                                                                                                       |
| `ldsc_rg.couples_chunk_size` | Number of correlations calculated within each shard. Can be kept low (\~10) for small runs but if the number of comparison is huge (e.g. all FG endpoints) then the number can be in the hundreds |

The pipeline now supports premunging so that (almost) any input sumstats can be provided and the pipeline will take care to output the right input for LDSC, as long as the basic information is provided in the json. These are the columns involved.

<table><thead><tr><th width="373">Parameter</th><th>Description</th></tr></thead><tbody><tr><td><code>ldsc_rg.premunge_fg.p_col</code></td><td>Column name for p-values in sumstats input.</td></tr><tr><td><code>ldsc_rg.premunge_fg.a1_effect_col</code></td><td>Column name for effect (alt) allele.</td></tr><tr><td><code>ldsc_rg.premunge_fg.a2_ne_col</code></td><td>Column name for non-effect (ref) allele.</td></tr><tr><td><code>ldsc_rg.premunge_fg.beta_col</code></td><td>Column name for effect size (beta).</td></tr><tr><td><code>ldsc_rg.premunge_fg.rsid_col</code></td><td>Column name for rsID (variant IDs). Leave blank if missing!</td></tr><tr><td><code>ldsc_rg.premunge_fgs.chrom_col</code></td><td>Column name for chromosome (if needed for variant parsing).</td></tr><tr><td><code>ldsc_rg.premunge_fg.pos_col</code></td><td>Column name for variant position (if needed for variant parsing).</td></tr></tbody></table>

N.B. the pipeline will also map `CHROM_POS` --> `rsid` if necessary, but using a build 38 mapping.\
\
If `rsid_col` is passed, the content of `chrom_col` and `pos_col` is ignored whatever it is. It can also be left blank. If one wants to use chrompos notation instead as input `rsid_col` should be left blank.

E.g. In this case `#chrom` and `pos` are used to build rsids

```
"ldsc_rg.premunge_fg.rsid_col": "",
"ldsc_rg.premunge_fg.chrom_col": "#chrom",
"ldsc_rg.premunge_fg.pos_col": "pos"
```

If `meta_other != meta_fg` then once has to update all the columns accordingly

```
"ldsc_rg.meta_other": "OTHER_LIST.txt",
"ldsc_rg.premunge_other.a1_effect_col": "alt",
"ldsc_rg.premunge_other.a2_ne_col": "ref",
"ldsc_rg.premunge_other.beta_col": "beta",
....
```

### Input sumstats

The input files and sumstats need to be formatted as tab separated format (TSV) with 3 columns (`phenocode`,`path_to_sumstats`, `N_total`). For example:

```
AD_AM_EXMORE    gs://finngen-production-library-green/ldsc/test/munged/AD_AM_EXMORE.premunged.gz    11345
KRA_PSY_ANXIETY_EXMORE    gs://finngen-production-library-green/ldsc/test/munged/KRA_PSY_ANXIETY_EXMORE.premunged.gz    263812
```

### Outputs:

The pipeline produces the following outputs:

* "ldsc\_rg.herit\_tsv" --> TSV file with heritabilites for all sumstats
* "ldsc\_rg.herit\_log" --> log file for heritabilites
* "ldsc\_rg.corr\_summary" --> TSV with genetic correlations
* "ldsc\_rg.corr\_log" --> log file for correlations
* "dsc\_rg.munged\_ss" --> the munged summarystats in LDSC format<br>

If the pipeline is run to have only heritabilites, the `corr_summary` and `corr_log` file will still be output, but they will be duplicated of the heritabilites file


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.finngen.fi/working-in-the-sandbox/running-analyses-in-sandbox/how-to-run-ldsc-pipeline.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
