Chloroplast assembly protocol

A set of scripts for the assembly of chloroplast genomes out of whole-genome sequencing reads

Authors Carlos P Cantalapiedra (1), Ruben Sancho (1,2), Bruno Contreras Moreira (1,3)

Estación Experimental de Aula Dei-CSIC, Zaragoza, Spain
Escuela Politécnica Superior de Huesca, U.Zaragoza, Spain
Fundación ARAID, Zaragoza, Spain

Citation

This protocol was designed as part of this paper:

Sancho R, Cantalapiedra CP, López-Alvarez D, Gordon SP, Vogel JP, Pilar Catalán P,Contreras-Moreira (2018) Comparative plastome genomics and phylogenomics of Brachypodium: flowering time signatures, introgression and recombination in recently diverged ecotypes. New Phytologist, doi:10.1111/nph.14926 | preprint | ENA sequences: LT222229-LT222230 & LT558582-LT558636

Software dependencies

This protocol has been tested on Linux x86_64 systems, although it should also work on Mac-OSX settings. It requires perl5, which should be installed on most Linux environments, and the following packages:

g++
make
libz-dev
libpcre3-dev
java , or default-jre in Ubuntu

In addition, it uses some third-party programs listed below, which are provided pre-compiled. In order to check whether they work on your machine, or to re-compile them otherwise, for instance on Mac-OSX, please run on the terminal:

perl ./install.pl

The script will tell which programs are set up (OK) and which require installing a compiler (gcc or g++) or java in order to run.

This is the full list of software, located in bin/, required for the protocol:

Name	shipped version	Source
DUK	March 03, 2011	http://duk.sourceforge.net
Trimmomatic	0.32	http://www.usadellab.org/cms/?page=trimmomatic
FastQC	0.10.1	http://www.bioinformatics.babraham.ac.uk/projects/fastqc
musket	1.0.6	http://musket.sourceforge.net
BWA	0.7.6a	http://bio-bwa.sourceforge.net
velvet	1.2.08	https://github.com/dzerbino/velvet/tree/master
SSPACE	2.0	http://www.baseclear.com/bioinformatics-tools
GapFiller	v1-11	http://www.baseclear.com/bioinformatics-tools
seqtk		https://github.com/lh3/seqtk
split_pairs	0.5	https://github.com/eead-csic-compbio/split_pairs

Optionally samtools is used in HOWTOcheck_assembly.txt.

Trouble-shooting: The following error message:

bin/split_pairs_v0.5/kseqread: symbol lookup error: bin/split_pairs_v0.5/kseqread: undefined symbol: _ZN7pcrecpp2RE4InitERKSsPKNS_10RE_OptionsE

can be sorted as follows:

cd bin/split_pairs_v0.5
make clean
make
cd ../..

Input reads

This pipeline has been tested with paired-end (PE) and mate-pairs (MP) read libraries, with the following FASTQ quality encodings: Illumina 1.5 (Phred+64) and Sanger/Illumina (Phred+33). The code assumes that PE libraries have FR orientation, meaning that read pairs are "--><--". MP libraries are expected to be RF but can be set to FR as well.

Reference sequences

poaceae.fna is a FASTA file containing chloroplast (cp) genomes of several representative Poaceae species which provide DNA k-mers that help fishing out cp reads out of whole-genome read sets. It is used by 0_get_cp_reads.pl. This file should be adapted to the particular phylogenetic group under study.
reference.fna is required by this protocol, and it should contain the cp genome of a related species.
reference4columbus.fna is required for reference-guided assembly. Contains the cp genome of a related species split in two non-overlapping contigs so that the first conting contains the fragment LSC-IRa and the second the fragment SSC-IRb. This allows assembly of both inverted repeat (IRa and IRb) regions, as discovered by McPherson et al. Check the FASTA headers in both reference files to see how they must look like.

Examples

Input reads are within "test" directory.

Reference-guided assembly

Fish cp reads from whole genome library, using provided test reads (see flowchart):

./0_get_cp_reads.pl test test_cp poaceae.fna

Clean and trim reads to remove poor quality segments; output includes mean insert sizes and orientations (see flowchart):

./1_cleanreads.pl -folder test_cp -ref reference.fna

######- using a single PE library

Create a config file "test_cp/assembly_pe" cp test_cp/cleanreads.txt test_cp/assembly_pe and edit it leaving only one row:

1 testPE cp-testPE.wind15_28.3crop70.mlen60.corr.12.fq.gz FR 221 1.5
Assemble cp genome from a single PE library (see flowchart-1 and flowchart-2):

./2_assemble_reads.pl test_cp assembly_pe -ref reference.fna -refcolumbus reference4columbus.fna

######- using both PE and MP libraries

Then, create a different config file to use both read libraries cp test_cp/cleanreads.txt test_cp/assembly_mp and edit it reordering rows so that testPE is number 1:

1 testPE cp-testPE.wind15_28.3crop70.mlen60.corr.12.fq.gz FR 221 Sanger

2 testMP cp-testMP.wind15_28.3crop70.mlen60.corr.12.fq.gz RF 4295 Sanger
Assemble cp genome combining PE + MP libraries (see flowchart-1 and flowchart-2):

./2_assemble_reads.pl test_cp assembly_mp -ref reference.fna -refcolumbus reference4columbus.fna

Example of de-novo assembly

Fish cp reads from WGS library

./0_get_cp_reads.pl test test_cp_noref poaceae.fna

clean and trim reads to remove poor quality segments

./1_cleanreads.pl -folder test_cp_noref

Create config file test_cp_noref/assembly_pe cp test_cp_noref/cleanreads.txt test_cp_noref/assembly_pe and edit test_cp_noref/assembly_pe file leaving one (PE reads) or two rows (PE + MP reads; see previous examples).

In this case we are leaving a single PE library, and note that we have to provide orientation and insert size (FR and 221 in this example):

1 testPE cp-testPE.wind15_28.3crop70.mlen60.corr.12.fq.gz FR 221 Sanger
Finally, assemble cp genome

./2_assemble_reads.pl test_cp_noref assembly_pe

For more info and parameters of the scripts see README.txt

In addition, examples yielding assembled data could be run following the steps at:

HOWTO_brachy.txt: assembly with Brachypodium Bd21 data (PE).
HOWTO_barley.txt: assembly with barley cultivar Morex data (PE + MP).

Post-assembly inspection

The tutorial HOWTOcheck_assembly.txt contains recipes to analyze and validate your assemblies.

Propagation of annotated features to assemblies

If a good quality master annotation in GenBank format is at hand, it can be used to propagate its features to several assembled chloroplast genomes with accompanying script _annot_fasta_from_gbk.pl. The way to invoke it is as follows, assuming strain XYZ was assembled:

perl _annot_fasta_from_gbk.pl reference.gbk assembly.fa XYZ_chloroplast.gbk XYZ

NOTE: This script requires setting a valid path to BLAST+ binaries installed on your system, plus Bioperl installed on your system.

Generating tracks from aligned assemblies

If several chloroplast assemblies are aligned they can then be further used to produce tracks for software CIRCOS. Script _check_matrix.pl does just that; it requires setting a reference strain and a window size, please check the code. An example call would be:

perl _check_matrix.pl chloroplast_alignment.fna

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chloroplast assembly protocol

Citation

Software dependencies

Input reads

Reference sequences

Examples

Reference-guided assembly

Example of de-novo assembly

For more info and parameters of the scripts see README.txt

Post-assembly inspection

Propagation of annotated features to assemblies

Generating tracks from aligned assemblies

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
bin		bin
pics		pics
test		test
0_get_cp_reads.pl		0_get_cp_reads.pl
1_cleanreads.pl		1_cleanreads.pl
2_assemble_reads.pl		2_assemble_reads.pl
HOWTO_barley.txt		HOWTO_barley.txt
HOWTO_brachy.txt		HOWTO_brachy.txt
HOWTOcheck_assembly.txt		HOWTOcheck_assembly.txt
LICENSE		LICENSE
README.md		README.md
README.txt		README.txt
_annot_fasta_from_gbk.pl		_annot_fasta_from_gbk.pl
_check_matrix.pl		_check_matrix.pl
install.pl		install.pl
log.clean		log.clean
log.cpreads		log.cpreads
poaceae.fna		poaceae.fna
reference.fna		reference.fna
reference4columbus.fna		reference4columbus.fna
reference_barley.fna		reference_barley.fna

License

eead-csic-compbio/chloroplast_assembly_protocol

Folders and files

Latest commit

History

Repository files navigation

Chloroplast assembly protocol

Citation

Software dependencies

Input reads

Reference sequences

Examples

Reference-guided assembly

Example of de-novo assembly

For more info and parameters of the scripts see README.txt

Post-assembly inspection

Propagation of annotated features to assemblies

Generating tracks from aligned assemblies

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages