Skip to content

microPIECE (microRNA pipeline enhanced by CLIP experiments)

License

Notifications You must be signed in to change notification settings

microPIECE-team/microPIECE

Repository files navigation

microPIECE

The microPIECE (microRNA pipeline enhanced by CLIP experiments) takes the AGO-CLIP data from a speciesA and transfers it to a speciesB. Given a set of miRNAs from speciesB it then predicts their targets on the transfered CLIP regions.

For the minimal workflow it needs a genome file, as well as its annotation file in GFF format for speciesA and speciesB. For speciesA at least one AGO-CLIP dataset is needed and speciesB needs a set of miRNAs for the target prediction. For the full workflow, a set of smallRNA-sequencing data is additionally needed and a set of non-coding RNAs can be provided as filter. The pipeline uses the smallRNA data for the mining of novel microRNAs and the completion of the given miRNA dataset, if needed. It further performs expression calculation, isoform detection, genomic loci identification and orthology determination.

Status

Build Status Coverage Status

Required Software

Required Perl modules

Installation

Please install the dependencies and run

git clone -b v1.5.2 https://github.com/microPIECE-team/microPIECE.git

or download the latest release as *.tar.gz or *.zip file:

curl -L -o microPIECE_v1.5.2.tar.gz https://github.com/microPIECE-team/microPIECE/archive/v1.5.2.tar.gz
# or
curl -L -o microPIECE_v1.5.2.zip https://github.com/microPIECE-team/microPIECE/archive/v1.5.2.zip

Docker

We also provide microPIECE as DOCKER image. We tested the image on Ubuntu, Debian and MacOS. For the latter one, the Piranha command make test fails during the build, but when entering the container, the test succeds. Therefore, we temporarily excluded this statement.

Information about the docker images:

Branch Size Layers Comment
Latest release v1.5.2
docker pull micropiece/micropiece:v1.5.2
git clone https://github.com/microPIECE-team/microPIECE-testset.git testset
docker run -it --rm -v $PWD:/data micropiece/micropiece:v1.5.2 microPIECE.pl   \
  --genomeA testset/NC_035109.1_reduced_AAE_genome.fa  \
  --genomeB testset/NC_007416.3_reduced_TCA_genome.fa   \
  --annotationA testset/NC_035109.1_reduced_AAE_genome.gff   \
  --annotationB testset/NC_007416.3_reduced_TCA_genome.gff   \
  --clip testset/SRR5163632_aae_clip_reduced.fastq,testset/SRR5163633_aae_clip_reduced.fastq,testset/SRR5163634_aae_clip_reduced.fastq   \
  --clip testset/SRR5163635_aae_clip_reduced.fastq,testset/SRR5163636_aae_clip_reduced.fastq,testset/SRR5163637_aae_clip_reduced.fastq --adapterclip GTGTCAGTCACTTCCAGCGG  \
  --overwrite \
  --smallrnaseq a=testset/tca_smallRNAseq_rna_contaminated.fastq \
  --adaptersmallrnaseq3=TGGAATTCTCGGGTGCCAAGG \
  --adaptersmallrnaseq5 GTTCAGAGTTCTACAGTCCGACGATC \
  --filterncrnas testset/TCA_all_ncRNA_but_miR.fa \
  --speciesB tca 2>&1 | tee out.log

Usage

Input data

  • minimal workflow
    • speciesA genome
    • speciesA GFF
    • speicesA AGO-CLIP-sequencing library/libraries
    • speciesB genome
    • speciesB GFF
    • speciesB microRNA set (mature)
  • full workflow (in addition to the minimal workflow)
    • speciesB non-codingRNA set (without miRNAs)
    • speciesB microRNA set (precursor)
    • speciesB smallRNA-sequencing library/libraries

PARAMETERS

  • --version|-V

    version of this pipeline

  • --help|-h

    prints a helpful help message

  • --genomeA and --genomeB

    Genome of the species with the CLIP data (species A, --genomeA) and the genome of the species where we want to predict the miRNA targets (species B, --genomeB)

  • --gffA and --gffB

    Genome feature file (GFF) of the species with the CLIP data (species A, --gffA) and the GFF of the species where we want to predict the miRNA targets (species B, --gffB)

  • --clip

    Comma-separated CLIP-seq .fastq files in Format

      --clip con1_rep1_clip.fq,con1_rep2_clip.fq,con2_clip.fq
      # OR
      --clip con1_rep1_clip.fq --clip con1_rep2_clip.fq --clip con2_clip.fq
    
  • --adapterclip

    Sequencing-adapter of CLIP reads

  • --smallrnaseq

    Comma-separated smallRNA-seq FASTQ files, initialized with 'condition=' in Format

      --smallrnaseq con1=A.fastq,B.fastq --smallrnaseq con2=C.fq
      # OR
      --smallrnaseq con1=A.fastq --smallrnaseq con1=B.fastq --smallrnaseq con2=C.fq
    
  • --adaptersmallrnaseq5 and --adaptersmallrnaseq3

    5' adapter of smallRNA-seq reads (--adaptersmallrnaseq5) and for 3' end (--adaptersmallrnaseq3)

  • --filterncrnas

    Multi-fasta file of ncRNAs to filter smallRNA-seq reads. Those must not contain miRNAs.

  • --threads

    Number of threads to be used

  • --overwrite

    set this parameter to overwrite existing files

  • --testrun

    sets this pipeline to testmode (accounting for small testset in piranha). This option should not be used in real analysis!

  • --out

    output folder

  • --mirna

    miRNA set, if set, mining is disabled and this set is used for prediction

  • --speciesBtag

    Three letter code of species where we want to predict the miRNA targets (species B, --speciesBtag).

  • --mirbasedir

    The folder specified by --mirbasedir is searched for the files organisms.txt.gz, mature.fa.gz, and hairpin.fa.gz. If the files are not exist, they will be downloaded.

  • --tempdir

    The folder specified by --tempdir is used for temporary files. The default value is tmp/ inside the output folder specified by the --out parameter.

  • --piranahbinsize

    Sets the Piranah bin size and has a default value of 30.

  • --CLIPminProcessLength and --CLIPmaxProcessLength

    Both are integer values and set the lower and upper limit for the processed peak length. Peaks having a width below --CLIPminProcessLength or above --CLIPmaxProcessLength are ignored. Default values are 22 for --CLIPminProcessLength and 50 for --CLIPmaxProcessLength.

  • --CLIPminlength

    An integer value specifying the minimal length of a CLIP peak to be processed. Default value is 0, meaning no minimal length for CLIP peaks.

OUTPUT

  • pseudo mirBASE dat file: final_mirbase_pseudofile.dat

    A pseudo mirBASE dat file containing all precursor sequences with their named mature sequences and their coordinates. It only contain the fields:

    • ID
    • FH and FT
    • SQ
  • mature miRNA set: mature_combined_mirbase_novel.fa

    mature microRNA set, containing novels and miRBase-completed (if mined), together with the known miRNAs from miRBase

  • precursor miRNA set: hairpin_combined_mirbase_novel.fa

    precursor microRNA set, containing novels (if mined), together with the known miRNAs from miRBase

  • mature miRNA expression per condition: miRNA_expression.csv

    Semicolon-separated file containing:

      1. rpm
      1. condition
      1. miRNA
  • orthologous prediction file: miRNA_orthologs.csv

    tab-separated file containing:

      1. query_id
      1. subject_id
      1. identity
      1. alignment length
      1. number mismatches
      1. number gap openings
      1. start position inside query
      1. end position inside query
      1. start position inside subject
      1. end position inside subject
      1. evalue
      1. bitscore
      1. aligned query sequence
      1. aligned subject sequence
      1. length query sequence
      1. length subject sequence
      1. coverage for query sequence
      1. coverage for subject sequence
  • miRDeep2 mining result in HTML/CSV mirdeep_output.html/csv

    the standard output HTML/CSV file of miRDeep2

  • ISOMIR prediction files: isomir_output_CONDITION.csv

    semincolon delimited file containing:

      1. mirna
      1. substitutions
      1. added nucleotids on 3' end
      1. nucleotides at 5' end different from the annonated sequence
      1. nucleotides at 3' end different from the annonated sequence
      1. sequence
      1. rpm
      1. condition
  • genomics location of miRNAs: miRNA_genomic_position.csv

    tab delimited file containing:

      1. miRNA
      1. genomic contig
      1. identify
      1. length
      1. miRNA-length
      1. number mismatches
      1. number gapopens
      1. miRNA-start
      1. miRNA-stop
      1. genomic-start
      1. genomic-stop
      1. evalue
      1. bitscore
  • all library support-level target predictions: *_miranda_output.txt

    miranda output, reduced to the lines, starting with > only

  • all library support-level CLIP transfer .bed files: *transfered_merged.bed

    bed-file of the transferred CLIP-regions in speciesB transcriptome

Example

Testset

Feel free to test the pipeline with our microPIECE-testset DOI:

git clone https://github.com/microPIECE-team/microPIECE-testset.git

Alternative

CAVEATS

Complete list of open issues is available on Github-Issues.

Please report any new issues ad new Github-Issue.

Changelog

  • scheduled for next release

    No features planned

  • v1.5.2 (2018-04-13)

    Refactoring of CLIP_merge_bed_files.pl to reduce memory footprint by a factor of 10x (Fixes #174)

    Refactoring of Piranha run to support multithreading (Fixes #177)

    Fixing copy process of final files (Fixes #184)

    Setting default bin size for Piranha to 30 (Fixes #178)

    This version is archived as DOI.
  • v1.5.1 (2018-04-11)

    Added optimized pre-binning step for Piranha (Fixes #132)

    This version is archived as DOI.
  • v1.5.0 (2018-04-10)

    Removing additional length cutoff during CLIP transfer (Fixes #153)

    Add command line options --CLIPminProcessLength, --CLIPmaxProcessLength, and --CLIPminlength for length limits used in run_CLIP_process and run_CLIP_clip_mapper steps enabling processing of peaks with user defined widths (Fixes #145)

    Dynamic naming of output files based on minlength variable in run_CLIP_clip_mapper (Fixes #146)

    Correct calculation of length of a bed feature and moving scripts/CLIP_bedtool_discard_sizes.pl into lib/microPIECE.pm (Fixes [#147](https://github.com/microPIECE-team/microPIECE/iss ues/147))

    Add an optimized pre-binning step with pseudocounts for bins covered by an exon as preparation for Piranha (Fixes #132 and #155)

    This version is archived as DOI.

    This version was accepted by The Journal of Open Source Software (Review issue #616)

  • v1.4.0 (2018-03-31)

    Copying pseudo mirBASE dat file final_mirbase_pseudofile.dat into output folder (Fixes #131)

    Corrected RNA::HairpinFigure output (Fixes #137)

    Fix the requirement of an accession inside mirBASE dat file (Fixes #134)

    Avoiding error message while copying the out file for genomic location into base folder (Fixes #117)

  • v1.3.0 (2018-03-29)

    Creating all structures on the fly using pseudo-mirBASE-dat as input.

    Using miRNA.dat from mirBASE as source for mature/precursor sequence and relationship (Fixes #127)

    Fix of division-by-zero bug for empty mapping files (Fixes #118)

    Fix of typo in --piranhabinsize option (Fixes #116)

  • v1.2.3 (2018-03-26)

    Fix transformation of precursor sequences based on mirbase #22 precursor sequences with a single mature. (Fixes L<#109|https://github.com/microPIECE-team/microPIECE/issues/109>)

  • v1.2.2 (2018-03-23)

    Improved collision detection for newly identified miRNAs avoiding crashed caused by genomic copies. (Fixes #105)

  • v1.2.1 (2018-03-23)

    Enables stable numbering for newly identified miRNAs based on their precursor and mature sequences (Fixes #101)

  • v1.2.0 (2018-03-22)

    We are using miraligner which requires a java version 1.7, but 1.8 was installed by default. This was fixed by switching to v1.4 of the docker base image. Additionally, miraligner requires fix filenames for its databases. Therefore, the version v1.2.0 solved miraligner related bugs and reenables the isomir detection. (Fixes #97 and #98)

  • v1.1.0 (2018-03-12)

    Add isomir detection and copy the final genomic location file to the output filter (Fixes #34)

  • v1.0.7 (2018-03-08)

    Piranha was lacking of a bin_size parameter. Added parameter --piranahbinsize with a default value of 20 (Fixes #66)

  • v1.0.6 (2018-03-08)

    Added parameter --mirbasedir and --tempdir to support local mirbase files and relocation of directory for temporary files (Fixes #66, #73, and #76)

  • v1.0.5 (2018-03-07)

    Update of documentation and correct spelling of --mirna parameter

  • v1.0.4 (2018-03-07)

    Fixes complete mature in final output (Fixes #69)

  • v1.0.3 (2018-03-06)

    Add tests for perl scripts in script folder which ensure the correct handling of BED stop coordinates (Fixes #65)

  • v1.0.2 (2018-03-05)

    Fixes the incorrect sorting of BED files, result was correct, but sorting was performed in the wrong order. (Fixes #63)

  • v1.0.1 (2018-03-05)

    Fix an error conserning BED file handling of start and stop coordinates. (Fixes #59)

  • v1.0.0 (2018-03-05)

    is archived as DOI and submitted to The Journal of Open Source Software.
  • v0.9.0 (2018-03-05)

    first version archived at Zenodo with the DOI

License

This program is released under GPLv2. For further license information, see LICENSE.md shipped with this program. Copyright(c)2018 Daniel Amsel and Frank Förster (employees of Fraunhofer Institute for Molecular Biology and Applied Ecology IME) All rights reserved.

AUTHORS

SEE ALSO

Project source code on Github Docker image on DockerHub Travis continuous integration page Test coverage reports