Skip to content

4. Polygenic Localization with PolyLoc

Omer Weissbrod edited this page Apr 13, 2020 · 14 revisions

PolyLoc takes an input file with posterior means and standard deviations of causal effect sizes (estimated by run_finemapper). PolyLoc uses this file to partition SNPs into bins of similar posterior per-SNP heritability, and estimates the heritability causally explained by each bin. PolyLoc consists of three stages:

  1. Partition SNPs into bins of similar posterior per-SNP heritability
  2. Compute LD-scores for the SNP bins
  3. Estimate the heritability casaully explained by each bin. This stage requires summary statistics based on different data than the data used to run PolyFun (see Weissbrod et al. 2019 bioRxiv for details).

PolyLoc and PolyFun have similar input files and they share many command-line arguments. You can see all the PolyLoc options by typing `python polyloc.py --help`. We now describe each of the stages of PolyLoc in detail.



PolyLoc stage 1: Partition SNPs into bins of similar posterior per-SNP heritability

This stage requires a file with posterior causal effect sizes of SNPs (ideally all genome-wide SNPs, or at least all the ones in genome-wide significant loci), as created by run_finemapper.py. Here is an example file (seen via cat example_data/posterior_betas.gz | zcat | head):

CHR  SNP         BP       A1  A2  Z         N       BETA_MEAN  BETA_SD
1    rs3748597   888659   T   C   0.85233   383290  0.00021    0.00005
1    rs6661956   2465912  C   T   0.98513   383290  0.00161    0.00006
1    rs2376821   2974852  T   C   -0.73402  383290  -0.00186   0.00111
1    rs10797386  3168280  A   G   -0.59635  383290  -0.00057   0.00006
1    rs7548372   4943523  C   A   2.60293   383290  0.00643    0.00014
1    rs6678838   5003147  G   T   1.10294   383290  0.00135    0.00008
1    rs6675995   5226752  G   A   -1.22154  383290  -0.00027   0.00034
1    rs2012852   5541205  C   A   0.30166   383290  0.00022    0.00027
1    rs10864271  7135268  A   G   2.03717   383290  0.00471    0.00031

The column BETA_MEAN contains the posterior means of causal effect sizes (as estimated by PolyFun), and the column BETA_SD contains their posterior standard deviation. The other required columns are CHR, SNP, BP, A1, A2.

Here is an example command that uses this file:

mkdir -p output

python polyloc.py \
    --compute-partitions \
    --output-prefix output/polyloc_test \
    --posterior example_data/posterior_betas.gz \
    --bfile-chr example_data/reference. 

The parameters we provided are the following:

  1. --compute-partitions - this tells PolyLoc to partition SNPs into bins of similar posterior per-SNP heritability
  2. --output-prefix - this specifies the prefix of all PolyLoc output file names
  3. --posterior - this specifies the input file with posterior means and standard deviations of causal effect sizes. This file should ideally include information for all SNPs. Every SNP not included in this file will be treated as if its posterior causal effect size is approximately zero, potentially leading to suboptimal polygenic localization if it's an important SNP. As in PolyFun, this file can either be a text file (possibly gzipped) or a parquet file, which allows faster loading.
  4. --bfile-chr - this is the prefix of plink files that PolyLoc will use to assign SNPs not reported in the --posterior file into bins. As in PolyFun, there must be one plink file for each chromosome.

Additional optional parameters are:

  1. --skip-Ckmedian - This tells PolyLoc to partition SNPs into bins using scikits-learn instead of Ckmedian. This is a suboptimal clustering, so Ckmedian is preferable. You should only specify this argument if rpy2 and/or Ckmeans.1d.dp are not installed on your machine or you can't get them to run.
  2. -num-bins <K> - this specifies the number of bins to partition SNPs into. By default PolyLoc will try to estimate this number. You should specify this number if either (a) you specified --skip-Ckmedian (because scikits-learn cannot estimate the number of bins) or (b) the estimation is too slow.



PolyLoc stage 2: Compute LD-scores for the SNP bins

This stage is similar to the LD-score computation stage in PolyFun. In this stage PolyLoc will compute LD-scores for each SNP bin. This is the most computationally intensive part of PolyLoc. PolyLoc can compute LD-scores for all chromosomes in the same run, or for only one chromosome at a time. The second option is recommended if you can run multiple jobs on a cluster. These LD-scores can be based on either genotyping data from a reference panel (encoded in plink binary files) or it can be based on our pre-computed UK Biobank LD matrices.

Here is an example that computes LD-scores for only chromosome 1, using genotypes from the 1000-genomes reference panel:

python polyloc.py \
    --output-prefix output/polyloc_test \
    --compute-ldscores \
    --bfile-chr example_data/reference. \
    --chr 1

Here is an example that computes LD-scores for only chromosome 1 by downloading our precomputed UK Biobank LD matrices:

mkdir -p LD_cache
python polyloc.py \
    --output-prefix output/polyloc_test \
    --compute-ldscores \
    --ld-ukb \
    --ld-dir LD_cache \
    --bfile-chr example_data/reference. \
    --chr 1
Please note the following:
  1. PolyLoc accepts the same parameters as PolyFun for LD-scores computations.
  2. You must specify the same --output-prefix argument that you provided in stage 1, because PolyLoc requires intermediate files that were created in stage 1.
  3. If you remove the flag --chr, PolyLoc will iterate over all chromosomes and compute LD-scores for all of them, which may take a long time.
  4. The genotype or LD data should be population-matched to your study. Ideally this data should come from your study directly. In the first example we used a small subset of SNPs of European-ancestry individuals from the 1000 genomes project. In the second example we used precomputed summary LD information from the UK Biobank.
  5. If you compute LD from individual-level data, there are various parameters that you can use to control the LD-score computations, analogue to the respective parameters in the ldsc package. Please type python polyloc.py --help to see all available parameters. The parameter --keep <keep file> can be especially useful if you have a very large reference panel and would like to speed-up the computations by using only a subset of individuals.
  6. You must provide a --bfile-chr flag even if you use precomputed summary LD information. In this case PolyLoc only requires a plink .bim file, so you can create such a file even if you don't have access to individual level genotypes.
  7. You can run stages 1 and 2 together by invoking polyloc.py with both the flags --compute-partitions and --compute-ldscores.



PolyLoc stage 3: Estimate the heritability casaully explained by each bin

This stage requires LD-score weights files and a summary statistics file that is different from the one used to compute posterior causal effect sizes (to prevent biased estimates due to winner's curse). Here is an example command:

python polyloc.py \
    --output-prefix output/polyloc_test \
    --compute-polyloc \
    --w-ld-chr example_data/weights. \
    --sumstats example_data/sumstats2.parquet

This commands creates two output files: output/polyloc_test.bin_h2 and output/polyloc_test.Mp. The first output file shows estimates of the heritability causally explained by the SNPs in each bin. The output for our example (seen by typing cat output/polyloc_test.bin_h2) is:

BIN  BIN_SIZE  SUM_BINSIZE  %H2      %H2_STDERR  SUM_%H2  SUM_%H2_STDERR
1    10        10           0.17250  0.01015     0.17250  0.01015
2    32        42           0.13510  0.01878     0.30760  0.01970
3    70        112          0.12648  0.01779     0.43408  0.02313
4    110       222          0.13666  0.01394     0.57074  0.02198
5    193       415          0.15129  0.01394     0.72203  0.02134
6    255       670          0.09629  0.01070     0.81832  0.01907
7    315       985          0.07411  0.00857     0.89243  0.01611
8    410       1395         0.05337  0.00840     0.94580  0.01297
9    492       1887         0.05057  0.00877     0.99637  0.00807
10   639       2526         0.00363  0.00807     1.00000  0.00000
11   949       3475         0.00000  0.00000     1.00000  0.00000

The output shows that PolyLoc partitioned SNPs into 11 bins of similar posterior per-SNP heritability. The first bin includes 10 SNPs that jointly explain 17.25% of the total SNP heritability, the second bin includes 32 SNPs that jointly explain 13.5% of the total SNP heritability, and so on. The identities of the SNPs in each bin are the SNPs in the posterior effect sizes file, ranked according to their posterior per-SNP heritability estimates (i.e., the sum of their squared posterior mean and their squared posterior standard deviation). That is, the SNPs in bin 1 are the 10 SNPs with the largest posterior per-SNP heritability, the SNPs in bin 2 are the next top ranked 32 SNPs, and so on.


The second output file shows estimates of the the minimal number of SNPs that causally explain different proportions p of the SNP heritability (denoted Mp). Here are the first 50 lines of the output for our example (seen by typing head -51 output/polyloc_test.Mp):

p       Mp      Mp_STDERR
1       1       0.00000
2       2       0.00000
3       2       0.00000
4       3       0.00000
5       3       0.00000
6       4       0.00000
7       5       1.40360
8       5       0.00000
9       6       0.00000
10      6       0.00000
11      7       0.00000
12      7       2.76434
13      8       0.00000
14      9       1.40360
15      9       0.00000
16      10      0.99500
17      10      0.99500
18      12      3.12584
19      15      4.54874
20      17      1.72627
21      19      4.76246
22      22      3.54006
23      24      2.97164
24      26      7.10397
25      29      3.14172
26      31      4.25020
27      34      6.50135
28      36      4.22261
29      38      5.38279
30      41      6.23894
31      44      10.83559
32      49      11.50175
33      55      11.07154
34      60      11.79390
35      66      10.80248
36      72      11.79495
37      77      10.92612
38      83      12.09894
39      88      11.54147
40      94      12.26475
41      99      11.86560
42      105     12.71486
43      110     13.26258
44      117     18.60806
45      125     18.93879
46      133     18.51573
47      141     18.18059
48      149     17.72502
49      158     17.44208
50      166     17.29313

Each line shows the number of top-ranked SNPs (with respect to posterior per-SNP heritability estimated by PolyFun) that causally explain proportion p% of SNP heritability. For example, the last line shows that PolyLoc estimates that 166 top-ranked SNPs are required to causally explain 50% of SNP heritability, with a standard error of 17.29.

General notes about PolyLoc

  1. The summary statistics that you provide to stage 3 of PolyLoc must be based on a different dataset than the dataset that was to run PolyFun to compute per-SNP heritabilities. Otherwise, your results are subject to winner's curse and could be extremely biased!
  2. As in PolyFun, you can run jointly multiple stages of PolyLoc by using several mode parameters (e.g. python polyloc.py ----compute-partitions --compute-ldscores --compute-polyloc).
  3. If you use run_finemapper to fine-map multiple loci, you can concatenate the output files of these loci (making sure there aren't duplicate SNPs that appear twice) and provide the concatenated file as an input posterior file.
  4. PolyLoc estimates per-bin heritability with respect to all SNPs in the plink files. If you would like to exclude to e.g. only common SNPs (as done in the PolyFun paper), please create plink files with only common SNPs and subset your posterior file accordingly (no need to subset the sumstats file).