02 Generating Data

For development and testing purposes, the sacCer3 organism is sufficient, since this is the smallest database/index to download or generate. This document thus deals with only `sacCer3 as an example.

A Snakemake workflow is used to initialize databases and generate required data for this project. These are the steps run in the workflow:

Snakemake DAG

Run Snakemake workflow

Navigate to the docker/snakemake folder.

To generate the full databases, use the following command:
```
 snakemake --cores 1 --use-conda
```
Running this command will generate bam files for all available organisms and enzymes listed in config.json. Keep in mind that this process will be time and resource-intensive due to the large amount of data involved (it can take days!).
Alternatively, users can customize the databases using the --config flag. You can specify the desired organisms, enzymes, and max_kmers to generate partial databases:
- max_kmers (int): Defines the number of kmers to generate. The default is inf.
- organisms (list): Generates database(s) for the specified organism(s). The default is ["sacCer3", "hg38", "ce11", "dm6", "mm10", "mm39", "rn6", "t2t_chm13"].
- enzymes (list): Generates database(s) for the specified enzyme(s). The default is ["cas9", "cpf1"].
For example, if you want to generate the sacCer3/cas9 databases with only the first 1000 kmers, use the following command:
```
 snakemake --cores 1 --use-conda --config max_kmers=1000 organisms=[\"sacCer3\"] enzymes=[\"cas9\"]
```

Folder Structure

After running the workflow with --config organisms=[\"sacCer3\"], the output data folder will have the following structure. In this case, set GUIDESCAN_BAM_PATH to the absolute path to the databases folder, and GUIDESCAN_INDEX_PATH to the absolute path to the indices folder.

├── databases
│   ├── cas9
│   │   ├── sacCer3.bam
│   │   ├── sacCer3.bam.bai	
│   │   ├── sacCer3.bam.sorted
│   │   ├── sacCer3.bam.sorted.bai
│   │   └── sacCer3.sam
│   └── cpf1
│       ├── sacCer3.bam
│       ├── sacCer3.bam.bai	
│       ├── sacCer3.bam.sorted
│       ├── sacCer3.bam.sorted.bai
│       └── sacCer3.sam
├── indices
│   ├── sacCer3.index.forward
│   ├── sacCer3.index.gs
│   └── sacCer3.index.reverse
├── job_status
│   ├── add_sacCer3.txt
│   └── init_db.txt
├── kmers
│   ├── cas9
│   │   └── sacCer3.csv
│   └── cpf1
│       └── sacCer3.csv
└── raw
    ├── sacCer3_chr2acc
    ├── sacCer3.fna
    ├── sacCer3.fna.forward.dna
    ├── sacCer3.fna.reverse.dna
    └── sacCer3.gtf.gz

Downloading pre-generated data

As an alternative to generating data, you can use guidescan download to download data directly from our website.

To see what is available to download

guidescan download --type database --show item
guidescan download --type index --show item

To download BAM file for a particular organism/enzyme combination:

guidescan download --type database --item sacCer3_cas9

To download index files for a particular organism:

guidescan download --type index --item sacCer3

Folder Structure

After downloading/unzipping the required files, you will want to rename files and arrange them in the following folder structure:

.
├── databases
│   └── cas9
│       └── sacCer3.bam.sorted
└── indices
    ├── sacCer3.index.forward
    ├── sacCer3.index.gs
    └── sacCer3.index.reverse

In particular, BAM files should be named <organism>.bam.sorted and arranged by <enzyme> folder. Index files should be named <organism>.index.<extension>. In the case shown above, set GUIDESCAN_BAM_PATH to the absolute path to the databases folder, and GUIDESCAN_INDEX_PATH to the absolute path to the indices folder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

02 Generating Data

Run Snakemake workflow

Folder Structure

Downloading pre-generated data

Folder Structure

Clone this wiki locally