How to run the LDSC unmodifiable pipeline

Introduction

The LDSC pipeline is used for calculating heritabilities and genetic correlations for disease endpoints using ldscarrow-up-right. The complete documentation for the pipeline can be found in githubarrow-up-right.

You can find the pipeline as UnmodifiableGeneticCorrelationLDSCDF[RELEASE]" in the unmod pipeline page in sandbox.

How it works

Summary:

  • Start with metadata tables describing your GWAS summary statistics.

  • The pipeline splits these into chunks for parallel processing.

  • Each chunk runs premunge_ss (preprocessing/mapping), then munge_ldsc (munging + heritability calculation).

  • Results are gathered with gather_h2 (heritabilities).

  • If only_het is false, pairwise combinations are built with return_couples and correlations are computed in parallel using multi_rg, with summary outputs collected at the end.

Inputs

Here are all the required inputs for the general pipeline

The pipeline works in two modes conceptually, either by calculating cross correlations within a single list or between two lists.

  • If the same list is passed twice (input_ss and comparison_ss) the pipeline will calculate N*(N-1)/2 correlations

  • If two different lists are passed then N*M correlations will be calculated

The advantage of this setup is that now cache calling is improved as the internal munging required by LDSC is done only once, speeding up operations for everyone. One can use the number of correlations to run to reverse egineer the chunks parameter mentioned above. Ideally one should aim to keep the numebr of jobs per shard in the low hundreds.

Parameter
Description

ldsc_rg.only_het

If true, computes only heritabilities (default: false, will also compute genetic correlations).

ldsc_rg.input_ss

Path to the primary metadata table (TSV) with summary statistics info.

ldsc_rg.comparison_ss

Path to the comparison/secondary metadata table (for cross-trait analyses). If one wants to calculate all correlation across one list this file should match input_ss

ldsc_rg.name

Prefix for output files.

ldsc_rg.population

LD reference population key. "fin" or "eur"

ldsc_rg.return_couples.chunks

Number of parallel batches for genetic correlation computations.

These are the inputs specifically required for (pre)munging

Parameter
Description

ldsc_rg.premunge_ss.p_col

Column name for p-values in sumstats input.

ldsc_rg.premunge_ss.a1_effect_col

Column name for effect (alt) allele.

ldsc_rg.premunge_ss.a2_ne_col

Column name for non-effect (ref) allele.

ldsc_rg.premunge_ss.beta_col

Column name for effect size (beta).

ldsc_rg.premunge_ss.rsid_col

Column name for rsID (variant IDs). Leave blank if missing!

ldsc_rg.premunge_ss.chrom_col

Column name for chromosome (if needed for variant parsing).

ldsc_rg.premunge_ss.pos_col

Column name for variant position (if needed for variant parsing).

If rsid_col is passed, the content of chrom_col and pos_col is ignored whatever it is. It can also be left blank. If one wants to use chrompos notation instead as input rsid_col should be left blank.

N.B. No lifting takes place, so make sure the chr/pos belong to build 38 E.g. In this case #chrom and pos are used to build rsids

Input sumstats

The input files and sumstats need to be formatted as tab separated format (TSV) with 3 columns (phenocode,path_to_phenocode, N_total). For example:

Outputs:

The pipeline produces the following outputs:

  • "ldsc_rg.herit_tsv" --> TSV file with heritabilites for all sumstats

  • "ldsc_rg.herit_log" --> log file for heritabilites

  • "ldsc_rg.corr_summary" --> TSV with genetic correlations

  • "ldsc_rg.corr_log" --> log file for correlations

If the pipeline is run to have only heritabilites, the corr_summary and corr_log file will still be output, but they will be duplicated of the heritabilites file

Last updated

Was this helpful?