7. FAQ

Q: Can PolyFun estimate which annotations are most relevant for my phenotype?
A: It is better to use stratified LD-score regression (S-LDSC) for this purpose, by estimating functional enrichment. The reason is that annotations tend to be correlated, so the PolyFun annotation coefficients can be very misleading. S-LDSC is included as part of the PolyFun code base. Please see the Wiki page on estimating functional enrichment using S-LDSC for details.

Q: Should I create a base annotations that includes only the number 1 for all SNPs?
A: Typically yes. However, in some cases the LD-scores for this annotation may be linearly dependent on the LD-scores of your other annotations, in which case you don't need to create this annotation. This can happen if (1) the vector of ones [1.0 1.0 ... 1.0] is linearly dependent on your other annotations (which holds for the baseline-LF annotations); and (2) The LD-score that you compute for each SNP is based on (almost) exactly the same set of SNPs as your set of summary statistics SNPs. Hence, we did not include a base annotation in our version of the baseline-LF annotations.

Q: Can I add extra annotations on top of the baseline-LF annotations, without creating huge new files from scratch?
A: Yes. The flag --ref-ld-chr accepts a comma-separated list of file name prefixes, just like standard S-LDSC. For example, you can create a set of annotation files called my_annot.1.annot.parquet, ... my_annot.22.annot.parquet, and then invoke polyfun as follows:

python polyfun.py \
    --compute-h2-L2 \
    --output-prefix output/testrun \
    --sumstats example_data/sumstats.parquet \
    --ref-ld-chr example_data/annotations.,my_annot. \
    --w-ld-chr example_data/weights.

Q: Why am I getting all kinds of strange error messages?
A: Before reporting an error, please make sure that you have updated versions of all the required packages. In particular, you should have pandas version >=0.25.0, which includes many features not found in previous versions. You can check your version of pandas by typing the following in a python shell:

import pandas as pd
pd.__version__

Q: Can I use the UKB LD matrices for fine-mapping my non-UKB data?
A: Generally no, unless you have strictly British-ancestry individuals (see e.g. Benner et al. 2017 AJHG, Ulirsch et al. 2019 Nat Genet).

Q: I'm getting error messages saying that not all of my SNPs have annotations info.
A: The best solution is to create annotations for all your SNPs. Otherwise you might miss truly causal SNPs simply because they did not have annotations info. However, if you're willing to take this risk you can omit such SNPs by providing the flag --allow-missing to all invocations of polyfun.py.

Q: Can I use the polyfun code to run regular S-LDSC?
A: Yes, you can invoke the script ldsc.py just like you would invoke it for regular S-LDSC. This may be more convenient for some people because it supports Python 3 and .parquet files. However, please note that we did not exhaustively test this version of S-LDSC.

Q: I'm getting weird error messages related to rpy2 (and/or) I'm getting the error message Skipping fine-mapping test because either rpy2 or SuSiE are not properly installed. Please try reinstalling rpy2 and/or SuSiE.
A: It is sometimes tricky to install rpy2 because it needs to be linked to a proper installation of R. To test this, please type python -m rpy2.situation. You should ideally see output showing that the version or R in the path is the same as the version used to build rpy2. If it's not, please try updating your PATH environmental variable to the directory containing the correct R version.

Q: Why does the UKB LD matrices directory have some missing regions / files with suffix .npz2?
A: These are long-range LD regions. We noticed that fine-mapping produces unreliable results in these regions, with many false-positive findings. The missing regions include the MHC region in chromosome 6, which has an extremely complex LD architecture. We suggest to omit such long-range LD regions from fine-mapping, or at least to take the results with a large grain of salt.

Q: Can I use the UKB LD files for my own purposes?
A: Sure. You can load the LD files into memory with the following Python function. It will return two Pandas dataframes:

An LD matrix
A dataframe with the properties of each SNP

Both dataframes have the same index, so you can reference one from the other.

import numpy as np
import pandas as pd
import scipy.sparse as sparse
def load_ld_npz(ld_prefix):
    '''
    ld_prefix is the prefix of a pair of LD files (e.g. "chr10_102000001_105000001")
    '''
    
    #load the SNPs metadata
    gz_file = '%s.gz'%(ld_prefix)
    df_ld_snps = pd.read_table(gz_file, sep='\s+')
    df_ld_snps.rename(columns={'rsid':'SNP', 'chromosome':'CHR', 'position':'BP', 'allele1':'A1', 'allele2':'A2'}, inplace=True, errors='ignore')
    assert 'SNP' in df_ld_snps.columns
    assert 'CHR' in df_ld_snps.columns
    assert 'BP' in df_ld_snps.columns
    assert 'A1' in df_ld_snps.columns
    assert 'A2' in df_ld_snps.columns
    df_ld_snps.index = df_ld_snps['CHR'].astype(str) + '.' + df_ld_snps['BP'].astype(str) + '.' + df_ld_snps['A1'] + '.' + df_ld_snps['A2']
        
    #load the LD matrix
    npz_file = '%s.npz'%(ld_prefix)
    try: 
        R = sparse.load_npz(npz_file).toarray()
        R += R.T
    except ValueError:
        raise IOError('Corrupt file: %s'%(npz_file))

    #create df_R and return it
    df_R = pd.DataFrame(R, index=df_ld_snps.index, columns=df_ld_snps.index)
    return df_R, df_ld_snps

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

7. FAQ

Clone this wiki locally