The program FALCON-formatter takes fastq and fasta files from a Pacific Biosciences sequencer and formats them for de novo assembly with FALCON.
Even though it is more convenient to store all reads in a single FASTA or FASTQ file on your system, Dazzler (and therefore FALCON) does not accept this kind of input. All inputs MUST be in FASTA format with files split by barcode, set, and part number. This means that fields 1-6 in the example below must be unique to each input file.
m140415_143853_42175_c100635972550000001823121909121417_s1_p0/553/3100_11230
1yymmdd_hhmmss 33333 4444444444444444444444444444444444 55 66 777 8888888888
- “m” = movie
- Time of Run Start (yymmdd_hhmmss)
- Instrument Serial Number
- SMRT Cell Barcode
- Set Number
- Part Number
- ZMW hole number*
- Subread Region (start_stop using polymerase read coordinates)*
- These fields are only used in fasta/q headers
More information about file formats can be found at the SMRT-Analysis wiki.
Below is an example that demonstrates this requirement and process by correctly splitting the file Example.fasta.
Example.fasta
>m140415_143853_42175_c100635972550000001823121909121417_s1_p0/553/3100_11230
>m140415_143853_42175_c324508543089230982134098587348034_s1_p0/553/103_725
>m140415_143853_42175_c324508543089230982134098587348034_s1_p0/553/973_13390
>m140415_143853_42175_c100635972550000001823121909121417_s1_p0/553/15030_17394
In the 4 headers, there are two unique 1-6 field sets:
>m140415_143853_42175_c100635972550000001823121909121417_s1_p0
>m140415_143853_42175_c324508543089230982134098587348034_s1_p0
All subreads corresponding to these headers need to be in their own files, so Example.fasta would be split accordingly:
m140415_143853_42175_c100635972550000001823121909121417_s1_p0.fasta
>m140415_143853_42175_c100635972550000001823121909121417_s1_p0/553/3100_11230
>m140415_143853_42175_c100635972550000001823121909121417_s1_p0/553/15030_17394
m140415_143853_42175_c324508543089230982134098587348034_s1_p0.fasta
>m140415_143853_42175_c324508543089230982134098587348034_s1_p0/553/103_725
>m140415_143853_42175_c324508543089230982134098587348034_s1_p0/553/973_13390
FALCON-formatter takes FASTA/Q files or folders of files as input, converts the FASTQ to FASTA and writes each read to a file corresponding to fields 1 through 6.
Using setuptools
git clone https://github.com/zyndagj/FALCON-formatter
cd FALCON-formatter
python setup.py install --user
Using pip
pip install --user git+https://github.com/zyndagj/FALCON-formatter
The program FALCON-formatter (installed in $HOME/.local/bin
) takes fastq and fasta files from a Pacific Biosciences sequencer and formats them for de novo assembly with FALCON.
FALCON-formatter [-h] [-w INT] [-o STR] F [F ...]
Argument | Description |
---|---|
F | Fastq/a files for folder for formatting |
Flag | Option | Description |
---|---|---|
-h | show this help message and exit | |
-w | INT | hard-wrap fasta output at [80] base-pairs |
-o | DIR | output path [.] |
$ FALCON-formatter ecoli.fasta
Processing: ecoli.fasta
If you’re coming from Cyverse, you first need to find the FALCON-formatter app in the HPC app catalog and launch it. Then, click on the “Inputs” drop down arrow to designate your inputs.
Then, click the browse button to open up a file explorer to choose your input.
Select either a single fastq/fastq file or a whole folder to process.
Click “Launch Analysis” to start your job. You’ll get notifications when the program starts and when it finishes.