Make personal protein databases using next generation sequencing data.
Installation on a Unix/linux distribution (including OS X) is straightforward and enumerated below. To simplify this further (and offer a solution that should work on most platforms), we have included a docker file that will setup a docker container with all necessary dependencies installed.
With Docker
First, install the community edition docker application for your system.
Next, clone the github repo, which has the latest docker image for genpro and use the docker image to build a container with all dependencies (and GenPro installed).
git clone https://github.com/wingolab-org/GenPro.git
cd GenPro
docker build -t genpro .
There are a few options for running GenPro using the docker container. The
simplest way is to login to the container using the command below. For this to
be useful, we need to allow the docker container to write to the host system.
This is done with the -v
option, as in -v /your/machine:/docker/container
.
You will need to set the host directory accordingly and note that in the example
below it is set to the working directory where you launch the genpro
docker
container.
docker run -it -v $(pwd):/GenProData genpro bash
The approach is:
- Install local::lib.
- Install App::cpanminus.
- Clone the GenPro repository.
- Install the prebuilt GenPro package in the repository.
Install local::lib
by downloading the
latest tarball and unpack it. See the
bootstrapping section of the
documentation.
Example:
curl -O https://cpan.metacpan.org/authors/id/H/HA/HAARG/local-lib-2.000019.tar.gz
tar xzvf local-lib-2.000019.tar.gz
cd local-lib-2.000019
perl Makefile.PL --bootstrap
make test && make install
Install App::cpanminus
.
curl -L http://cpanmin.us | perl - App::cpanminus
Install GenPro with cpanm
, which will automatically install any uninstalled
dependencies.
git clone https://github.com/wingolab/GenPro.git
cd GenPro
cpanm GenPro.tar.gz
An alternative approach is clone the repository, unpack the GenPro.tar.gz
tarball and install it manually. This will likely require using sudo
,
depending on how Perl is setup.
git clone https://github.com/wingolab/GenPro.git
cd GenPro
tar xzvf GenPro.tar.gz
cd GenPro
perl Makefile.PL
make test
make install
GenPro_download_ucsc_data.pl -a -d hg38 -g hg38
This will perform a dry-run download of hg38 (genome and annotated gene
coordinates). It relies on rsync
being installed, which should be present on
unix, linux, and OS X by default. In the example, the data will be downloaded
into hg38
directory, which may be created if it did not already exist.
GenPro_download_ucsc_data.pl
will download knownGenes track and the genome of
the organism by default. By default, GenPro_download_ucsc_data.pl
is set to a
dry-run (i.e., no download). Use the --act
switch to "act", i.e., download the
data. Take care when using it since it since it will download from a remote
server.
- Use
GenPro_create_db.pl
. - There are helper scripts
sh/runall_create_db.sh
that work with SGE to build all chromosomes on a cluster with SGE. For example:
qsub -v USER -v PATH -cwd -t 1-26 runall_create_db.sh <genome>
- Alternatively, iterate over all the chromosomes. For example:
for ((i=1;i<27;i++)); do
GenPro_create_db.pl -g hg38 -c $i \
--genedir hg38/gene \
--geneset knownGene \
-f hg38/chr -o hg38/idx;
done;
GenPro_make_refprotdb.pl
uses the indexed binary annotation and creates all proteins for a given chromosome.- A helper script
runnall_make_refprot.sh
automates this for SGE. For example,
qsub -v USER -v PATH -cwd -t 1-26 runnall_make_refprot.sh \
<genome> \
<binary genome index> \
<output dir>
- This is a 2-step process using
GenPro_make_perprotdb1.pl
andGenPro_make_perprotdb2.pl
. GenPro_make_perprotdb1.pl
creates a per chromosome database of all relevant (i.e., nonsense/missense variants) for each sample in the snp file.- To convert
vcf
tosnp
you will need bcftools installed. Use the helper script,sh/vcfToSnp.sh
, to perform the conversion. GenPro_make_perprotdb2.pl
creates a finished personal protein database for each sample. -It provides two outputs:- A json-encoded file that enumerates the variant protein information
- A fasta file with all full-length reference and variant proteins, which may be used as input for a proteomics search program.
- Creating the final db is on a per sample basis, but it can take quite a while. It depends on how many proteins have multiple variants. It is especially slow for proteins with >10 variants. For example, if you have a protein with 20 substitutions, there are 20! permutations. By design, all 20! proteins will be considered and only the proteins that contribute unique peptides will be retained. Any protein with more than 20 sites will have all variants inserted into the reference protein without performing any permutation.