# How to run the LDSC unmodifiable pipeline

## Introduction

The LDSC pipeline is used for calculating [heritabilities](/background-reading/heritability-and-genetic-correlations.md#heritability) and [genetic correlations](/background-reading/heritability-and-genetic-correlations.md#genetic-correlation) for disease endpoints using [ldsc](https://github.com/bulik/ldsc). The complete documentation for the pipeline can be found in [github](https://github.com/FINNGEN/LDSC).

You can find the pipeline as `UnmodifiableGeneticCorrelationLDSCDF[RELEASE]"` in the unmod pipeline page in sandbox.

## How it works

Summary:

* Start with **metadata tables** describing your GWAS summary statistics.
* The pipeline splits these into **chunks** for parallel processing.
* Each chunk runs `premunge_ss` (preprocessing/mapping), then `munge_ldsc` (munging + heritability calculation).
* Results are gathered with `gather_h2` (heritabilities).
* If `only_het` is false, **pairwise combinations** are built with `return_couples` and correlations are computed in parallel using `multi_rg`, with summary outputs collected at the end.

<figure><img src="/files/uc1xM5vcZNiBPzPwQF7Z" alt=""><figcaption></figcaption></figure>

## Inputs

Here are all the required inputs for the general pipeline

The pipeline works in two modes conceptually, either by calculating cross correlations within a single list or between two lists.

* If the same list is passed twice (`input_ss` and `comparison_ss`) the pipeline will calculate `N*(N-1)/2` correlations
* If two different lists are passed then `N*M` correlations will be calculated

The advantage of this setup is that now cache calling is improved as the internal munging required by LDSC is done only once, speeding up operations for everyone. One can use the number of correlations to run to reverse egineer the `chunks` parameter mentioned above. Ideally one should aim to keep the numebr of jobs per shard in the low hundreds.

| Parameter                       | Description                                                                                                                                                             |
| ------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `ldsc_rg.only_het`              | If true, computes only heritabilities (default: false, will also compute genetic correlations).                                                                         |
| `ldsc_rg.input_ss`              | Path to the primary metadata table (TSV) with summary statistics info.                                                                                                  |
| `ldsc_rg.comparison_ss`         | Path to the comparison/secondary metadata table (for cross-trait analyses). If one wants to calculate all correlation across one list this file should match `input_ss` |
| `ldsc_rg.name`                  | Prefix for output files.                                                                                                                                                |
| `ldsc_rg.population`            | LD reference population key. "fin" or "eur"                                                                                                                             |
| `ldsc_rg.return_couples.chunks` | Number of parallel batches for genetic correlation computations.                                                                                                        |

These are the inputs specifically required for (pre)munging

| Parameter                           | Description                                                       |
| ----------------------------------- | ----------------------------------------------------------------- |
| `ldsc_rg.premunge_ss.p_col`         | Column name for p-values in sumstats input.                       |
| `ldsc_rg.premunge_ss.a1_effect_col` | Column name for effect (alt) allele.                              |
| `ldsc_rg.premunge_ss.a2_ne_col`     | Column name for non-effect (ref) allele.                          |
| `ldsc_rg.premunge_ss.beta_col`      | Column name for effect size (beta).                               |
| `ldsc_rg.premunge_ss.rsid_col`      | Column name for rsID (variant IDs). Leave blank if missing!       |
| `ldsc_rg.premunge_ss.chrom_col`     | Column name for chromosome (if needed for variant parsing).       |
| `ldsc_rg.premunge_ss.pos_col`       | Column name for variant position (if needed for variant parsing). |

If `rsid_col` is passed, the content of `chrom_col` and `pos_col` is ignored whatever it is. It can also be left blank. If one wants to use chrompos notation instead as input `rsid_col` should be left blank.

N.B. No lifting takes place, so make sure the chr/pos belong to build 38\
\
E.g. In this case `#chrom` and `pos` are used to build rsids

```
"ldsc_rg.premunge_ss.rsid_col": "",
"ldsc_rg.premunge_ss.chrom_col": "#chrom",
"ldsc_rg.premunge_ss.pos_col": "pos"
```

### Input sumstats

The input files and sumstats need to be formatted as tab separated format (TSV) with 3 columns (`phenocode`,`path_to_phenocode`, `N_total`). For example:

```
AD_AM_EXMORE    gs://finngen-production-library-green/ldsc/test/munged/AD_AM_EXMORE.premunged.gz    11345
KRA_PSY_ANXIETY_EXMORE    gs://finngen-production-library-green/ldsc/test/munged/KRA_PSY_ANXIETY_EXMORE.premunged.gz    263812
```

### Outputs:

The pipeline produces the following outputs:

* "ldsc\_rg.herit\_tsv" --> TSV file with heritabilites for all sumstats
* "ldsc\_rg.herit\_log" --> log file for heritabilites
* "ldsc\_rg.corr\_summary" --> TSV with genetic correlations
* "ldsc\_rg.corr\_log" --> log file for correlations<br>

If the pipeline is run to have only heritabilites, the `corr_summary` and `corr_log` file will still be output, but they will be duplicated of the heritabilites file


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.finngen.fi/working-in-the-sandbox/running-analyses-in-sandbox/how-to-run-ldsc-pipeline.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
