Skip to content

01 Developer Setup

Vineet Bansal edited this page Sep 23, 2024 · 12 revisions

Prerequisites

  • Conda (miniconda preferred)

Installation

  1. Create a new conda environment named guidescan with Python version 3.10 (and pip installed in the environment to avoid any surprises later).

    conda create --name guidescan python=3.10 pip
    
  2. Activate the environment.

    conda activate guidescan
    

    The command prompt will change to indicate the new conda environment by prepending (guidescan).

  3. Clone the repository and enter it:

    git clone https://github.com/pritykinlab/guidescanpy.git
    cd guidescanpy
    
  4. Install the package in editable mode and any optional dependencies:

    pip install -e ".[dev]"
    
  5. Install guidescan

    The core guidescan program is needed for indexing genomes and creating new databases, and it is sufficient that the binary be accessible in the activated conda environment. Unless you want to download and compile guidescan yourself, the easiest option is to install it from bioconda. The command that is likely to work for most platforms is:

    conda install -c conda-forge -c bioconda guidescan
    

    Verify guidescan version by running guidescan --version on the command line. Read the guidescan documentation on how to use the utility.

  6. Run tests

    This step is crucial to see if guidescanpy and guidescan are working correctly. Run:

    cd docker/snakemake
    snakemake -F guidescan_pytest --cores 1 --use-conda --config max_kmers=1000 enzymes="[\"cas9\"]" organisms="[\"sacCer3\"]"
    

    This will run a workflow that generates a small amount of test data (1000 kmers) for the sacCer3 organism and the cas9 enzyme, and run the unit tests found in the tests folder.

    Mac users on Apple Silicon (M1/M2/M3 CPUs): One of the steps in the workflow adds cutting-efficiency values to the generated databases, and uses Python 2.7 code supplied from a different Research Lab. You will want to set the environment variable CONDA_SUBDIR to osx-64 to allow conda to use Rosetta 2 emulation for these steps. In other words, the command you will want to run is:

    CONDA_SUBDIR=osx-64 snakemake -F guidescan_pytest --cores 1 --use-conda --config max_kmers=1000 enzymes="[\"cas9\"]" organisms="[\"sacCer3\"]"
    

Generate sample data

To start working on guidescanpy, we will likely need some real data.

"Data" in guidescanpy comprises of:

  • A relational database to store chromosome and gene information for organisms. By default this is a local sqlite database (guidescan.db).
  • BAM files that store on-target and off-target information for an organism + enzyme combination. For example, the sacCer3 organism + the cas9 enzyme combination will make up a single .bam file.
  • Index files that allow guidescan to quickly search an organism's genomic sequence. For example, the sacCer3 organism's sequence will have a single index (each index is made up of 3 files, as we'll see shortly).

To generate sample data for sacCer3/cas9, repeat the step we ran in (6) above, but with minor variations:

cd docker/snakemake
snakemake --cores 1 --use-conda --config enzymes="[\"cas9\"]" organisms="[\"sacCer3\"]"

Mac users on Apple Silicon (M1/M2/M3 CPUs): One of the steps in the workflow adds cutting-efficiency values to the generated databases, and uses Python 2.7 code supplied from a different Research Lab. You will want to set the environment variable CONDA_SUBDIR to osx-64 to allow conda to use Rosetta 2 emulation for these steps. In other words, the command you will want to run is:

CONDA_SUBDIR=osx-64 snakemake --cores 1 --use-conda --config enzymes="[\"cas9\"]" organisms="[\"sacCer3\"]"

This step will likely take a couple of hours. For other organisms, including hg38, it will take substantially more time. If you're impatient, you can download pre-generated BAM and index files from our website. See this link to see how.

However you choose to generate the data, you will need to set two environment variables, which tell guidescanpy the location of the BAM files and index files. These are GUIDESCAN_BAM_PATH and GUIDESCAN_INDEX_PATH respectively.

In the following example, we have downloaded the sacCer3+cas9 BAM file in databases/cas9, and the sacCer3 index files in indices.

$ pwd
/home/joe/guidescan/data

$ tree
.
├── databases
│   └── cas9
│       └── sacCer3.bam.sorted
└── indices
    ├── sacCer3.index.forward
    ├── sacCer3.index.gs
    └── sacCer3.index.reverse

Note the folder structure - the BAM file is stored in a sub-folder <enzyme> (cas9 or cpf1) inside databases, and the index files are <organism>.index.<extension> inside indices. So we can set the 2 required environment variables as:

export GUIDESCAN_BAM_PATH=/home/joe/guidescan/data/databases
export GUIDESCAN_INDEX_PATH=/home/joe/guidescan/data/indices

Run the web application

The guidescan.com website is made up of two parts - a Flask component which is the main web application, and a Celery task management component which handles long-running requests on the website. To start both of these, open up two terminal windows, and run the following commands. Both terminals need to have access to the environment variables we set above, so you may want to set those environment variables in your user profile.

--- Terminal 1 ---
conda activate guidescan
guidescan worker

--- Terminal 2 ---
conda activate guidescan
guidescan web

Note the link in the terminal when you run guidescan web (typically http://127.0.0.1:5001). This is the link you will use to open up the browser.

If you see a "Not Found" error (404) in the browser, append a /py to the address bar.

Keep both terminals active while you're interacting with the web application.

If you generated/downloaded data only for sacCer3, you will obviously only be able to run queries for that organism.

Contributing to guidescanpy

  • Start a new branch
cd <path_to_guidescanpy>
git checkout -b <your_awesome_branch_name>
  • Install the pre-commit hook. This will allow you to identify style/formatting/coding issues every time you commit your code. Pre-commit automatically formats the files in your repository according to certain standards, and/or warns you if certain best practices are not followed.
pre-commit install
  • Tweak/modify the code, make guidescanpy better! Send a PR towards the main branch.

Our CI will automatically run the pre-commit and pytest steps for PRs towards the protected branches, so running these steps on your local installation will prevent surprises for you later.


When you are done with the development, deactivate the guidescan environment and return to (base) by the following command:

conda deactivate