-
Notifications
You must be signed in to change notification settings - Fork 1
01 Developer Setup
- Conda (miniconda preferred)
-
Create a new conda environment named
guidescan
with Python version 3.10 (andpip
installed in the environment to avoid any surprises later).conda create --name guidescan python=3.10 pip
-
Activate the environment.
conda activate guidescan
The command prompt will change to indicate the new conda environment by prepending
(guidescan)
. -
Clone the repository and enter it:
git clone https://github.com/pritykinlab/guidescanpy.git cd guidescanpy
-
Install the package in editable mode and any optional dependencies:
pip install -e ".[web,dev]"
-
Install
guidescan
The core
guidescan
program is needed for indexing genomes and creating new databases, and it is sufficient that the binary be accessible in the activatedconda
environment. Unless you want to download and compile guidescan (the command line utility for indexing and database generation) yourself, the easiest option is to install it frombioconda
. The command that is likely to work for most platforms is:conda install -c conda-forge -c bioconda guidescan
Verify
guidescan
version by runningguidescan --version
on the command line. Read the guidescan documentation on how to use the utility. -
Run tests
This step is crucial to see if
guidescanpy
andguidescan
are working correctly. Run:cd docker/snakemake snakemake -F guidescan_pytest --cores 1 --use-conda --config max_kmers=1000 enzymes="[\"cas9\"]" organisms="[\"sacCer3\"]"
This will run a workflow that generates a small amount of test data (1000 kmers) for the
sacCer3
organism and thecas9
enzyme, and run the unit tests found in thetests
folder.
To start working on guidescanpy
, we will likely need some real data.
"Data" in guidescanpy
comprises of:
- A relational database to store chromosome and gene information for organisms. By default this is a local
sqlite
database (guidescan.db
). - BAM files that store on-target and off-target information for an organism + enzyme combination. For example, the
sacCer3
organism + thecas9
enzyme combination will make up a single.bam
file. -
Index files that allow
guidescan
to quickly search an organism's genomic sequence. For example, thesacCer3
organism's sequence will have a single index (each index is made up of 3 files, as we'll see shortly).
To generate sample data for sacCer3/cas9
, repeat the step we ran in (6) above, but with minor variations:
cd docker/snakemake
snakemake -F --cores 1 --use-conda --config enzymes="[\"cas9\"]" organisms="[\"sacCer3\"]"
This step will use all the CPU cores on your machine, and will likely take an hour or more. For other organisms, including hg38
, it will take substantially more time. If you're impatient, you can download pre-generated BAM and index files from our website. See this link to see how.
However you choose to generate the data, you will need to set two environment variables, which tell guidescanpy
the location of the BAM files and index files. These are GUIDESCAN_BAM_PATH
and GUIDESCAN_INDEX_PATH
respectively.
In the following example, we have downloaded the sacCer3+cas9
BAM file in databases/cas9
, and the sacCer3
index files in indices
.
$ pwd
/home/joe/guidescan/data
$ tree
.
├── databases
│ └── cas9
│ └── sacCer3.bam.sorted
└── indices
├── sacCer3.index.forward
├── sacCer3.index.gs
└── sacCer3.index.reverse
Note the folder structure - the BAM file is stored in a sub-folder <enzyme>
(cas9
or cpf1
) inside databases
, and the index files are <organism>.index.<extension>
inside indices
. So we can set the 2 required environment variables as:
export GUIDESCAN_BAM_PATH=/home/joe/guidescan/data/databases
export GUIDESCAN_INDEX_PATH=/home/joe/guidescan/data/indices
The guidescan.com website is made up of two parts - a Flask component which is the main web application, and a Celery task management component which handles long-running requests on the website. To start both of these, open up two terminal windows, and run the following commands. Both terminals need to have access to the environment variables we set above, so you may want to set those environment variables in your user profile.
--- Terminal 1 ---
conda activate guidescan
guidescan worker
--- Terminal 2 ---
conda activate guidescan
guidescan web
Note the link in the terminal when you run guidescan web
(typically http://127.0.0.1:5001
). This is the link you will use to open up the browser.
If you see a "Not Found" error (404) in the browser, append a
/py
to the address bar.
Keep both terminals active while you're interacting with the web application.
If you generated/downloaded data only for
sacCer3
, you will obviously only be able to run queries for that organism.
- Start a new branch
cd <path_to_guidescanpy>
git checkout -b <your_awesome_branch_name>
- Install the pre-commit hook. This will allow you to identify style/formatting/coding issues every time you commit your code. Pre-commit automatically formats the files in your repository according to certain standards, and/or warns you if certain best practices are not followed.
pre-commit install
-
Tweak/modify the code, make
guidescanpy
better! Send a PR towards themain
branch.
Our CI will automatically run the pre-commit
and pytest
steps for PRs towards the protected branches, so running these steps on your local installation will prevent surprises for you later.
When you are done with the development, deactivate the guidescan
environment and return to (base)
by the following command:
conda deactivate