-
Notifications
You must be signed in to change notification settings - Fork 1
02 Generating Data
For development and testing purposes, the
sacCer3
organism is sufficient, since this is the smallest database/index to download or generate. This document thus deals with only `sacCer3 as an example.
A Snakemake workflow is used to initialize databases and generate required data for this project. These are the steps run in the workflow:
Navigate to the docker/snakemake
folder.
-
To generate the full databases, use the following command:
snakemake --cores 1 --use-conda
Running this command will generate
bam
files for all available organisms and enzymes listed inconfig.json
. Keep in mind that this process will be time and resource-intensive due to the large amount of data involved (it can take days!). -
Alternatively, users can customize the databases using the
--config
flag. You can specify the desiredorganisms
,enzymes
, andmax_kmers
to generate partial databases:-
max_kmers (int)
: Defines the number of kmers to generate. The default isinf
. -
organisms (list)
: Generates database(s) for the specified organism(s). The default is["sacCer3", "hg38", "ce11", "dm6", "mm10", "mm39", "rn6", "t2t_chm13"]
. -
enzymes (list)
: Generates database(s) for the specified enzyme(s). The default is["cas9", "cpf1"]
.
For example, if you want to generate the
sacCer3/cas9
databases with only the first 1000kmers
, use the following command:snakemake --cores 1 --use-conda --config max_kmers=1000 organisms=[\"sacCer3\"] enzymes=[\"cas9\"]
-
After running the workflow with --config organisms=[\"sacCer3\"]
, the output data folder will have the following structure. In this case, set GUIDESCAN_BAM_PATH
to the absolute path to the databases
folder, and GUIDESCAN_INDEX_PATH
to the absolute path to the indices
folder.
├── databases
│ ├── cas9
│ │ ├── sacCer3.bam
│ │ ├── sacCer3.bam.bai
│ │ ├── sacCer3.bam.sorted
│ │ ├── sacCer3.bam.sorted.bai
│ │ └── sacCer3.sam
│ └── cpf1
│ ├── sacCer3.bam
│ ├── sacCer3.bam.bai
│ ├── sacCer3.bam.sorted
│ ├── sacCer3.bam.sorted.bai
│ └── sacCer3.sam
├── indices
│ ├── sacCer3.index.forward
│ ├── sacCer3.index.gs
│ └── sacCer3.index.reverse
├── job_status
│ ├── add_sacCer3.txt
│ └── init_db.txt
├── kmers
│ ├── cas9
│ │ └── sacCer3.csv
│ └── cpf1
│ └── sacCer3.csv
└── raw
├── sacCer3_chr2acc
├── sacCer3.fna
├── sacCer3.fna.forward.dna
├── sacCer3.fna.reverse.dna
└── sacCer3.gtf.gz
As an alternative to generating data, you can use guidescan download
to download data directly from our website.
- To see what is available to download
guidescan download --type database --show item
guidescan download --type index --show item
- To download BAM file for a particular organism/enzyme combination:
guidescan download --type database --item sacCer3_cas9
- To download index files for a particular organism:
guidescan download --type index --item sacCer3
After downloading/unzipping the required files, you will want to rename files and arrange them in the following folder structure:
.
├── databases
│ └── cas9
│ └── sacCer3.bam.sorted
└── indices
├── sacCer3.index.forward
├── sacCer3.index.gs
└── sacCer3.index.reverse
In particular, BAM files should be named <organism>.bam.sorted
and arranged by <enzyme>
folder. Index files should be named <organism>.index.<extension>
. In the case shown above, set GUIDESCAN_BAM_PATH
to the absolute path to the databases
folder, and GUIDESCAN_INDEX_PATH
to the absolute path to the indices
folder.