How to run the LDSC unmodifiable pipeline
Introduction
The LDSC pipeline is used for calculating heritabilities and genetic correlations for disease endpoints using ldsc. The complete documentation for the pipeline can be found in github.
How it works
You can find the pipeline as UnmodifiableGeneticCorrelationLDSCDF[RELEASE]"
in the unmod pipeline page in sandbox.
The pipeline takes the following inputs:
"ldsc_rg.only_het" -->
true|false
iftrue
it only produces heritabilities for the input sumstats"ldsc_rg.input_ss" --> table containing path to sumstats and meta data
"ldsc_rg.comparison_ss" --> optional second table analogous to the one above
"ldsc_rg.name" --> prefix for outputs
"ldsc_rg.population" -->
fin|eur
population to use"ldsc_rg.return_couples.chunks" --> how many correlation jobs are run per shard. Default is 4. If running very large jobs (hundreds of thousand of correlations), the number can be be increased massively
The pipeline works in two modes conceptually, either by calculating cross correlations within a single list or between two lists.
If the same list is passed twice (
input_ss
andcomparison_ss
) the pipeline will calculateN*(N-1)/2
correlationsIf two different lists are passed then
N*M
correlations will be calculated
The advantage of this setup is that now cache calling is improved as the internal munging required by LDSC is done only once, speeding up operations for everyone. One can use the number of correlations to run to reverse egineer the chunks
parameter mentioned above. Ideally one should aim to keep the numebr of jobs per shard in the low hundreds.
Input sumstats
The input files and sumstats need to be formatted as tab separated format (TSV) with 3 columns (phenocode
,path_to_phenocode
, N_total
). For example:
AD_AM_EXMORE gs://finngen-production-library-green/ldsc/test/munged/AD_AM_EXMORE.premunged.gz 11345
KRA_PSY_ANXIETY_EXMORE gs://finngen-production-library-green/ldsc/test/munged/KRA_PSY_ANXIETY_EXMORE.premunged.gz 263812
Pre-munge your summary statistics file(s):
Before running the pipeline, you need to make sure that input sumstats are coherent with the requirements by ldsc for its own munging step.
The required input format is as follows:
SNP A1 A2 BETA P
rs74337086 A G 0.0923 0.5059
rs76388980 A G 0.1227 0.2945
rs562172865 T C -0.0262 0.8142
rs780596509 A G -0.2202 0.1545
rs778009914 A G -0.3938 0.3044
rs564223368 T C 0.2195 0.03913
rs71628921 C A 0.1763 0.3682
rs577189614 A G 0.0845 0.5341
rs77357188 T C -0.0414 0.3383
To get summary statistics (in REGENIE output format) into right format, you can use the following example:
bash /finngen/library-green/scripts/ldsc/munge_sumstats.sh $SUM_STATS $OUT_FILE
where $SUM_STATS
is a path to your input summary statistics file, and $OUT_FILE
is the name of you munged summary statistics file.
Outputs:
The pipeline produces the following outputs:
"ldsc_rg.herit_tsv" --> TSV file with heritabilites for all sumstats
"ldsc_rg.herit_log" --> log file for heritabilites
"ldsc_rg.corr_summary" --> TSV with genetic correlations
"ldsc_rg.corr_log" --> log file for correlations
If the pipeline is run to have only heritabilites, the corr_summary
and corr_log
file will still be output, but they will be duplicated of the heritabilites file
Last updated
Was this helpful?