How to run the LDSC unmodifiable pipeline

Introduction

The LDSC pipeline is used for calculating heritabilities and genetic correlations for disease endpoints using ldsc. The complete documentation for the pipeline can be found in github.

How it works

You can find the pipeline as UnmodifiableGeneticCorrelationLDSCDF[RELEASE]" in the unmod pipeline page in sandbox.

The pipeline takes the following inputs:

"ldsc_rg.only_het" --> true|false if true it only produces heritabilities for the input sumstats
"ldsc_rg.input_ss" --> table containing path to sumstats and meta data
"ldsc_rg.comparison_ss" --> optional second table analogous to the one above
"ldsc_rg.name" --> prefix for outputs
"ldsc_rg.population" --> fin|eur population to use
"ldsc_rg.return_couples.chunks" --> how many correlation jobs are run per shard. Default is 4. If running very large jobs (hundreds of thousand of correlations), the number can be be increased massively

The pipeline works in two modes conceptually, either by calculating cross correlations within a single list or between two lists.

If the same list is passed twice (input_ss and comparison_ss) the pipeline will calculate N*(N-1)/2 correlations
If two different lists are passed then N*M correlations will be calculated

The advantage of this setup is that now cache calling is improved as the internal munging required by LDSC is done only once, speeding up operations for everyone. One can use the number of correlations to run to reverse egineer the chunks parameter mentioned above. Ideally one should aim to keep the numebr of jobs per shard in the low hundreds.

Input sumstats

The input files and sumstats need to be formatted as tab separated format (TSV) with 3 columns (phenocode,path_to_phenocode, N_total). For example:

AD_AM_EXMORE    gs://finngen-production-library-green/ldsc/test/munged/AD_AM_EXMORE.premunged.gz    11345
KRA_PSY_ANXIETY_EXMORE    gs://finngen-production-library-green/ldsc/test/munged/KRA_PSY_ANXIETY_EXMORE.premunged.gz    263812

Pre-munge your summary statistics file(s):

Before running the pipeline, you need to make sure that input sumstats are coherent with the requirements by ldsc for its own munging step.

The required input format is as follows:

SNP	A1	A2	BETA	P
rs74337086	A	G	0.0923	0.5059
rs76388980	A	G	0.1227	0.2945
rs562172865	T	C	-0.0262	0.8142
rs780596509	A	G	-0.2202	0.1545
rs778009914	A	G	-0.3938	0.3044
rs564223368	T	C	0.2195	0.03913
rs71628921	C	A	0.1763	0.3682
rs577189614	A	G	0.0845	0.5341
rs77357188	T	C	-0.0414	0.3383

To get summary statistics (in REGENIE output format) into right format, you can use the following example:

bash /finngen/library-green/scripts/ldsc/munge_sumstats.sh $SUM_STATS $OUT_FILE

where $SUM_STATS is a path to your input summary statistics file, and $OUT_FILE is the name of you munged summary statistics file.

Outputs:

The pipeline produces the following outputs:

"ldsc_rg.herit_tsv" --> TSV file with heritabilites for all sumstats
"ldsc_rg.herit_log" --> log file for heritabilites
"ldsc_rg.corr_summary" --> TSV with genetic correlations
"ldsc_rg.corr_log" --> log file for correlations

If the pipeline is run to have only heritabilites, the corr_summary and corr_log file will still be output, but they will be duplicated of the heritabilites file

PreviousHow to run colocalization pipeline NextHow to run PRS pipeline

Last updated 2 months ago

Was this helpful?