isoCirc: computational pipeline to identify high-confidence BSJs and full-length circRNA isoforms from isoCirc long-read data
isoCirc is a long-read sequencing strategy coupled with an integrated computational pipeline to characterize full-length circRNA isoforms using rolling circle amplification (RCA) followed by long-read sequencing.
- What is isoCirc?
- Installation
- Getting started
- Input and output
- Circular long-read alignment of isoCirc read
- FAQ
- Contact
- Changelog
isoCirc is dependent on two open-source software packages: bedtools
(>= v2.27.0) and minimap2 minimap2
(>= 2.11).
Please ensure that these packages are installed before running isoCirc.
isoCirc is written with python
, please use pip
to install isoCirc:
pip install isocirc # first time installation
pip install isocirc --upgrade # update to the latest version
Alternatively, you can install isoCirc from source:
git clone https://github.com/Xinglab/isoCirc.git
cd isoCirc/isoCirc_pipeline && pip install .
cd isoCirc/test_data
isocirc -t 1 read_toy.fa chr16_toy.fa chr16_toy.gtf chr16_circRNA_toy.bed output
Detailed arguments:
usage: isocirc [-h] [-v] [-t THREADS] [--bedtools BEDTOOLS]
[--minimap2 MINIMAP2] [--short-read short.fa/fq] [--lordec LORDEC]
[--kmer KMER] [--solid SOLID] [--trf TRF] [--match MATCH]
[--mismatch MISMATCH] [--indel INDEL] [--match-frac MATCH_FRAC]
[--indel-frac INDEL_FRAC] [--min-score MIN_SCORE]
[--max-period MAX_PERIOD] [--min-len MIN_LEN]
[--min-copy MIN_COPY] [--min-frac MIN_FRAC]
[--high-max-ratio HIGH_MAX_RATIO]
[--high-min-ratio HIGH_MIN_RATIO]
[--high-iden-ratio HIGH_IDEN_RATIO]
[--high-repeat-ratio HIGH_REPEAT_RATIO]
[--low-repeat-ratio LOW_REPEAT_RATIO]
[--cano-motif {GT/AG,all}] [--bsj-xid BSJ_XID]
[--key-bsj-xid KEY_BSJ_XID] [--min-circ-dis MIN_CIRC_DIS]
[--rescue-low] [--fsj-xid FSJ_XID] [--key-fsj-xid KEY_FSJ_XID]
[--Alu ALU] [--flank-len FLANK_LEN] [--all-repeat ALL_REPEAT]
long.fa/fq ref.fa anno.gtf circRNA.bed/gtf out_dir
isocirc: circular RNA profiling and analysis using long-read sequencing
positional arguments:
long.fa/fq Long-read sequencing data generated with isoCirc
ref.fa Reference genome sequence file
anno.gtf Gene annotation file in GTF format
circRNA.bed/gtf circRNA database annotation file in BED or GTF format
Use ',' to separate multiple circRNA annotation files
out_dir Output directory for final result and temporary files
optional arguments:
-h, --help Show this help message and exit
-v, --version Show program's version number and exit
General options:
-t THREADS, --threads THREADS
Number of threads to use (default: 8)
--bedtools BEDTOOLS Path to bedtools (default: bedtools)
--minimap2 MINIMAP2 Path to minimap2 (default: minimap2)
Hybrid error-correction with short-read data (LoRDEC):
--short-read short.fa/fq
Short-read data for error correction
Use ',' to connect multiple or paired-end short-read data
(default: )
--lordec LORDEC Path to lordec-correct (default: lordec-correct)
--kmer KMER k-mer size (default: 21)
--solid SOLID Solid k-mer abundance threshold (default: 3)
Consensus calling with Tandem Repeats Finder (TRF)):
--trf TRF Path to TRF program (default: trf409.legacylinux64)
--match MATCH Match score (default: 2)
--mismatch MISMATCH Mismatch penalty (default: 7)
--indel INDEL Indel penalty (default: 7)
--match-frac MATCH_FRAC
Match probability (default: 80)
--indel-frac INDEL_FRAC
Indel probability (default: 10)
--min-score MIN_SCORE
Minimum alignment score to report (default: 100)
--max-period MAX_PERIOD
Maximum period size to report (default: 2000)
Filtering and mapping of consensus sequences (minimap2):
--min-len MIN_LEN Minimum consensus length to keep (default: 30)
--min-copy MIN_COPY Minimum copy number of consensus to keep
(default: 2.0)
--min-frac MIN_FRAC Minimum fraction of original long read to keep
(default: 0.0)
--high-max-ratio HIGH_MAX_RATIO
Maximum mappedLen / consLen ratio for high-quality
alignment (default: 1.1)
--high-min-ratio HIGH_MIN_RATIO
Minimum mappedLen /consLen ratio for high-quality
alignment (default: 0.9)
--high-iden-ratio HIGH_IDEN_RATIO
Minimum identicalBases/ consLen ratio for high-quality
alignment (default: 0.75)
--high-repeat-ratio HIGH_REPEAT_RATIO
Maximum mappedLen / consLen ratio for high-quality
self-tandem consensus (default: 0.6)
--low-repeat-ratio LOW_REPEAT_RATIO
Minimum mappedLen / consLen ratio for low-quality
self-tandem alignment (default: 1.9)
Identifying high-confidence BSJs and full-length circRNAs:
--cano-motif {GT/AG,all}
Canonical back-splice motif (GT/AG or all three
motifs: GT/AG, GC/AG, AT/AC) (default: GT/AG)
--bsj-xid BSJ_XID Maximum allowed mis/ins/del for 20-bp exonic sequence
flanking BSJ (10-bp each side) (default: 1)
--key-bsj-xid KEY_BSJ_XID
Maximum allowed mis/ins/del for 4-bp exonic sequence
flanking BSJ (2-bp each side) (default: 0)
--min-circ-dis MIN_CIRC_DIS
Minimum distance between genomic coordinates of
two back-splice sites (default: 150)
--rescue-low Use high-mapping quality reads to rescue low-mapping
quality reads (default: False)
--fsj-xid SJ_XID Maximum allowed mis/ins/del for 20-bp exonic sequence
flanking FSJ (10-bp each side) (default: 1)
--key-fsj-xid KEY_SJ_XID
Maximum allowed mis/ins/del for 4-bp exonic sequence
flanking FSJ (2-bp each side) (default: 0)
Other options:
--Alu ALU Alu repetitive element annotation in BED format
(default: )
--flank-len FLANK_LEN
Length of upstream and downstream flanking sequence to
search for Alu (default: 500)
--all-repeat ALL_REPEAT
All repetitive element annotation in BED format
(default: )
isoCirc takes a long read containing multiple copies of a circRNA sequence as input
It also requires a reference genome sequence and gene annotation to be provided.
isoCirc outputs three result files in a user-specified directory:
No. | File name | Explanation |
---|---|---|
1 | isocirc.out | detailed information of each circRNA isoform with a high-confidence BSJ, in tabular format |
2 | isocirc.bed | bed12 format file of isocirc.out |
3 | isocirc_stats.out | basic summary stats numbers for isocirc.out |
isocirc.out
is a 35-column tabular file, each line represents one unique circRNA isoform that has a high-confidence BSJ:
No. | Column name | Explanation |
---|---|---|
1 | isoformID | assigned isoform ID |
2 | chrom | chromosome ID |
3 | startCoor0based | start coordinate of circRNA, 0-based |
4 | endCoor | end coordinate of circRNA |
5 | geneStrand | gene strand (+/-) |
6 | geneID | gene ID |
7 | geneName | gene name |
8 | blockCount | number of block |
9 | blockSize | size of each block, separated by , |
10 | blockStarts | relative start coordinates of each block, separated by , . refer to bed12 format for further details |
11 | refMapLen | total length of circRNA |
12 | blockType | category of each block. E: exon, I: intron, N: intergenic |
13 | blockAnno | detailed annotation for each block, in format: "TransID:E1(100)+I(50)+E2(30)", where TransID is ID of corresponding transcript; E1 and E2 are 1st and 2nd exon of transcript; multiple blocks are separated by , ; and multiple transcripts of one block are separated by ; |
14 | isKnownSS | True if SS is catalogued in gene annotation, False if not, separated by , |
15 | isKnownFSJ | True if FSJ is catalogued in gene annotation, False if not, separated by , |
16 | canoFSJMotif | strandness and canonical motifs of FSJs, e.g., -GT/AG , NA if FSJ is not canonical, separated by , |
17 | isHighFSJ | True if alignment around FSJ is high-quality, False if not, separated by , |
18 | isKnownExon | True if block is a catalogued exon in gene annotation, False if not, separated by ‘,’ |
19 | isKnownBSJ | True if BSJ exists in circRNA annotation, False if not; multiple circRNA annotations are separated by , |
20 | isCanoBSJ | True if BSJ has canonical motif (GT/AG), False if not |
21 | canoBSJMotif | strandness and canonical motif of BSJ, e.g., -GT/AG , NA if BSJ is not canonical |
22 | isFullLength | True if isoform is considered as full-length isoform , False if not |
23 | BSJCate | category of BSJs: FSM /NIC /NNC , see explanation below. |
24 | FSJCate | category of FSJs: FSM /NIC /NNC |
25 | CDS | number of bases that are mapped to CDS region |
26 | UTR | number of bases that are mapped to UTR region |
27 | lincRNA | number of bases that are mapped to lincRNA region |
28 | antisense | number of bases that are mapped to antisense region |
29 | rRNA | number of bases that are mapped to rRNA region |
30 | Alu | number of bases that are mapped to Alu element; NA if Alu annotation is not provided |
31 | allRepeat | number of bases that are mapped to all repeat elements; NA if repeat annotation is not provided |
32 | upFlankAlu | flanking Alu element in upstream; NA if none or Alu annotation is not provided |
33 | downFlankAlu | flanking Alu element in downstream; NA if none or Alu annotation is not provided |
34 | readCount | number of reads that come from this circRNA isoform |
35 | readIDs | ID of reads that come from this circRNA isoform, separated by , |
isocirc.bed
is a bed12 format file, each line represents one unique circRNA isoform from isocirc.out
:
No. | Column name | Explanation |
---|---|---|
1 | chrom | chromosome ID |
2 | startCoor0based | start coordinate of circRNA, 0-based |
3 | endCoor | end coordinate of circRNA |
4 | isoformName | name of the circRNA isoform |
5 | score | indicates how dark the peak will be displayed in the browser (0-1000), set as 0 by isoCirc |
6 | strand | +/- to denote strand |
7 | thickStart | the starting position at which the feature is drawn thickly, set as 0 by isoCirc |
8 | thickEnd | the ending position at which the feature is drawn thickly, set as 0 by isoCirc |
9 | itemRgb | an RGB value of the form R,G,B (e.g. 255,0,0), set as 0 by isoCirc |
10 | blockCount | number of block |
11 | blockSize | size of each block, separated by , |
12 | blockStarts | relative start coordinates of each block, separated by , . refer to bed12 format for further details |
isocirc_stats.out
contains 27 basic stats numbers for isocirc.out
:
No. | Name | Explanation |
---|---|---|
1 | Total reads | Number of raw reads in sample |
2 | Total reads with cons | Number of reads with consensus sequence called |
3 | Total mappable reads with cons | Number of reads with consensus sequence called, mappable to genome |
4 | Total reads with candidate BSJ | Number of reads with consensus sequence called, mappable to genome, and with BSJs ("candidate BSJs") |
5 | Total candidate BSJs | Number of candidate BSJs (circRNA species) |
6 | Total known candidate BSJs | Number of candidate BSJs reported in existing circRNA BSJ database (circBase / MiOncoCirc) |
7 | Total reads with high BSJs | Number of reads with consensus sequence called, mappable to genome, and with high-confidence BSJs (based on additional inspection of alignment around BSJs) |
8 | Total high BSJs | Number of high-confidence BSJs |
9 | Total known high BSJs | Number of high-confidence BSJs that are known |
10 | Total isoforms with high BSJs | Number of circRNA isoforms with high-confidence BSJs |
11 | Total isoforms with high BSJs high FSJs | Number of circRNA isoforms with high-confidence BSJs, and all FSJs are high-confidence (canonical, high-quality alignment around internal splice sites) |
12 | Total isoforms with high BSJ known SSs | Number of circRNA isoforms with high-confidence BSJs, and all SS are known (based on existing transcript GTF annotations for splice sites in linear RNA) |
13 | Total isoforms with high BSJs high FSJs known SSs | Number of circRNA isoforms with high-confidence BSJs, all FSJs are high-confidence, and all SS are known |
14 | Total full-length isoforms | Number of circRNA isoforms with high-confidence BSJs, and FSJs are high-confidence or all SS are known |
15 | Total reads for full-length isoforms | Number of reads for circRNA isoforms with high-confidence BSJs, and all FSJs arehigh-confidence or all SS are known |
16 | Total full-length isoforms with FSM BSJ | Number of full-length circRNA isoforms with FSM BSJs |
17 | Total reads for full-length isoforms with FSM BSJ | Number of reads for full-length circRNA isoforms with FSM BSJs |
18 | Total full-length isoforms with NIC BSJ | Number of full-length circRNA isoforms with NIC BSJs |
19 | Total reads for full-length isoforms with NIC BSJ | Number of reads for full-length circRNA isoforms with NIC BSJs |
20 | Total full-length isoforms with NNC BSJ | Number of full-length circRNA isoforms with NNC BSJs |
21 | Total reads for full-length isoforms with NNC BSJ | Number of reads for full-length circRNA isoforms with NNC BSJs |
22 | Total full-length isoforms with FSM FSJ | Number of full-length circRNA isoforms with FSM FSJs |
23 | Total reads for full-length isoforms with FSM FSJ | Number of reads for full-length circRNA isoforms with FSM FSJs |
24 | Total full-length isoforms with NIC FSJ | Number of full-length circRNA isoforms with NIC internal FSJs |
25 | Total reads for full-length isoforms with NIC FSJ | Number of reads for full-length circRNA isoforms with NIC FSJs |
26 | Total full-length isoforms with NNC FSJ | Number of full-length circRNA isoforms with NNC FSJs |
27 | Total reads for full-length isoforms with NNC FSJ | Number of reads for full-length circRNA isoforms with NNC FSJs |
- BSJ: Back-Splice Junction
- FSJ: Forward-Splice Junction
- FSS: Forward-Splice Site
- SS: Splice Site
- cons: consensus sequence
- cano: canonical
- high: high-confidence (canonical and high-quality alignment around FSJ/BSJ)
- FSM: Full Splice Match
- NIC: Novel In Catalog
- NNC: Novel Not in Catalog
With the result file generated by isocirc
, we can visulize the circular alignment of full-length isoCirc reads. Let's use the toy example in the test_data
again:
isocircPlot ./read_toy.fa ./chr16_toy.fa ./chr16_toy.gtf ./output/isocirc.bed ./isocircPlot_toy.list SJ ./output
A .png
file will be generated in the output
folder indicating how the isoCirc long read is aligned to the reference genome multiple times.
Yan Gao [email protected]
Yi Xing [email protected]