How to run the LDSC unmodifiable pipeline
Introduction
The LDSC pipeline is used for calculating heritabilities and genetic correlations for disease endpoints using ldsc. The complete documentation for the pipeline can be found in github.
You can find the pipeline as UnmodifiableGeneticCorrelationLDSCDF[RELEASE]" in the unmod pipeline page in sandbox.
How it works
Summary:
Start with metadata tables describing your GWAS summary statistics.
The pipeline splits these into chunks for parallel processing.
Each chunk runs
premunge_ss(preprocessing/mapping), thenmunge_ldsc(munging + heritability calculation).Results are gathered with
gather_h2(heritabilities).If
only_hetis false, pairwise combinations are built withreturn_couplesand correlations are computed in parallel usingmulti_rg, with summary outputs collected at the end.

Inputs
Here are all the required inputs for the general pipeline
The pipeline works in two modes conceptually, either by calculating cross correlations within a single list or between two lists.
If the same list is passed twice (
input_ssandcomparison_ss) the pipeline will calculateN*(N-1)/2correlationsIf two different lists are passed then
N*Mcorrelations will be calculated
The advantage of this setup is that now cache calling is improved as the internal munging required by LDSC is done only once, speeding up operations for everyone. One can use the number of correlations to run to reverse egineer the chunks parameter mentioned above. Ideally one should aim to keep the numebr of jobs per shard in the low hundreds.
ldsc_rg.only_het
If true, computes only heritabilities (default: false, will also compute genetic correlations).
ldsc_rg.input_ss
Path to the primary metadata table (TSV) with summary statistics info.
ldsc_rg.comparison_ss
Path to the comparison/secondary metadata table (for cross-trait analyses). If one wants to calculate all correlation across one list this file should match input_ss
ldsc_rg.name
Prefix for output files.
ldsc_rg.population
LD reference population key. "fin" or "eur"
ldsc_rg.return_couples.chunks
Number of parallel batches for genetic correlation computations.
These are the inputs specifically required for (pre)munging
ldsc_rg.premunge_ss.p_col
Column name for p-values in sumstats input.
ldsc_rg.premunge_ss.a1_effect_col
Column name for effect (alt) allele.
ldsc_rg.premunge_ss.a2_ne_col
Column name for non-effect (ref) allele.
ldsc_rg.premunge_ss.beta_col
Column name for effect size (beta).
ldsc_rg.premunge_ss.rsid_col
Column name for rsID (variant IDs). Leave blank if missing!
ldsc_rg.premunge_ss.chrom_col
Column name for chromosome (if needed for variant parsing).
ldsc_rg.premunge_ss.pos_col
Column name for variant position (if needed for variant parsing).
If rsid_col is passed, the content of chrom_col and pos_col is ignored whatever it is. It can also be left blank. If one wants to use chrompos notation instead as input rsid_col should be left blank.
N.B. No lifting takes place, so make sure the chr/pos belong to build 38
E.g. In this case #chrom and pos are used to build rsids
Input sumstats
The input files and sumstats need to be formatted as tab separated format (TSV) with 3 columns (phenocode,path_to_phenocode, N_total). For example:
Outputs:
The pipeline produces the following outputs:
"ldsc_rg.herit_tsv" --> TSV file with heritabilites for all sumstats
"ldsc_rg.herit_log" --> log file for heritabilites
"ldsc_rg.corr_summary" --> TSV with genetic correlations
"ldsc_rg.corr_log" --> log file for correlations
If the pipeline is run to have only heritabilites, the corr_summary and corr_log file will still be output, but they will be duplicated of the heritabilites file
Last updated
Was this helpful?