Merge pull request #19 from sjanssen2/classify-functions

Classify functions
qiime2 · Nov 23, 2017 · 2c59d6b · 2c59d6b
2 parents 7121732 + 0324829
commit 2c59d6b
Show file tree

Hide file tree

Showing 16 changed files with 331 additions and 49 deletions.
diff --git a/Example/insertion-taxonomy.qza b/Example/insertion-taxonomy.qza
diff --git a/README.md b/README.md
@@ -14,18 +14,18 @@ Once QIIME2 is [installed](https://docs.qiime2.org/2017.10/install/native/), and
 Beta diversity was computed for all 599 samples of [this study](https://qiita.ucsd.edu/study/description/10422) (manuscript in preparation) on the deblur table rarefied to 5,870 sequences per sample with 4,727 sOTUs total as unweighted unifrac distance with three alternative phylogenetic trees:
 
   A) De-novo by aligning 249nt long fragments via mafft and inferring a tree via fasttree - as suggested in the QIIME 2 "moving pictures" [tutorial](https://docs.qiime2.org/2017.10/tutorials/moving-pictures/#generate-a-tree-for-phylogenetic-diversity-analyses). Strong separation between observed clusters cannot be explained by any metadata, but the relative abundance of three sOTUs belonging to the genus *Methanobrevibacter*: not detectable in lower gray cluster, very low abundant in upper coloured cluster.
-  
+
   B) Mean path length from root to tips in the denovo tree is 0.94, while the lowest common ancestor for the three *Methanobrevibacter* sOTUs has an outstanding length of 1.43. Manually shortening the grandparent's branch length from 0.82 to 0.4 re-unites clusters.
 
   C) Inserting denovo fragments into a well curated reference phylogeny via the fragment insertion plugin also resolves cluster separation but does not require any manual manipulation.
-  
+
 Note: the same effects are observed when the sOTU table is not rarefied.
 
 ### Fragment insertion enables meta-analyses across different variable 16S regions and fragment length.
 
 <img src="Example/metaanalysis.png">
 
-Meta-analyses of two microbiome studies with heterogeneous variable 16S regions. 
+Meta-analyses of two microbiome studies with heterogeneous variable 16S regions.
 Both studies sampled the same three body products: [Study 'Family'](https://qiita.ucsd.edu/study/description/797) contains 854 human and 217 dog samples with 37,181 sOTUs of the first 128nt from V2 [[Song et al.]](http://dx.doi.org/10.7554/eLife.00458), while [study 'Yanomani'](https://qiita.ucsd.edu/study/description/10052) comprises 66 samples of uncontacted Amerindians in Venezuela with 17,249 sOTUs of the first 150nt of V4 [[Clemente et al.]](http://dx.doi.org/10.1126/sciadv.1500183).
 Beta diversity was computed on one non rarefied deblur table combining both studies as unweighted UniFrac distance.
 
@@ -46,10 +46,9 @@ A fragment may be reasonable to insert into multiple locations. However, downstr
 
 ## Files produced
 
-The plugin will generate three files:
+The plugin will generate two files:
   1. A `Phylogeny[Rooted]` type: This is the tree with the sequences placed (which could be inserted), and are identified by their corresponding sequence IDs. You can directly use this tree for phylogenetic diversity computation like UniFrac or Faith's Phylogenetic Diversity.
   2. A `Placements` type: It is a JSON object which, for every input sequence, describes the different possible placements.
-  3. And last a `FeatureData[Taxonomy]` type: This is a table that holds a taxonomic lineage string for every fragment inserted into the tree. The lineage is obtained by traversing the tree from the fragment tip towards the root and collecting all taxonomic labels in the reference tree along this path. Thus, taxonomy is only as good as provided reference phylogeny. Note, taxonomic labels are identified by containing two underscore characters `_` `_` as in Greengenes. **As of Nov 2017: We do NOT encourage the use of this file, since it has not been compared to existing taxonomic assignment methods. Particularly since the default reference tree is not inline with the reference taxonomy.**
 
 ## Example
 
@@ -59,21 +58,35 @@ Let us use the `FeatureData[Sequence]` from QIIME's tutorial as our input:
 
    - `rep-seqs.qza`: [view](https://view.qiime2.org/?src=https%3A%2F%2Fdocs.qiime2.org%2F2017.10%2Fdata%2Ftutorials%2Fmoving-pictures%2Frep-seqs.qza) | [download](https://docs.qiime2.org/2017.10/data/tutorials/moving-pictures/rep-seqs.qza)
 
-The following single command will produce three outputs: 1) `phylogeny.qza` is the `Phylogeny[Rooted]`, 2) `placements.qza` provides placement distributions for the fragments (you will most likely ignore this output) and 3) `classification.qza` which is a taxonomic classification for every fragment that has been inserted into the reference phylogeny and is of the type `FeatureData[Taxonomy]` (Computation might take some 10 minutes):
+The following single command will produce two outputs: 1) `phylogeny.qza` is the `Phylogeny[Rooted]` and 2) `placements.qza` provides placement distributions for the fragments (you will most likely ignore this output) (Computation might take some 10 minutes):
 ```
 qiime fragment-insertion sepp-16s-greengenes \
   --i-representative-sequences rep-seqs.qza \
   --o-tree insertion-tree.qza \
-  --o-placements insertion-placements.qza \
-  --o-classification insertion-taxonomy.qza
+  --o-placements insertion-placements.qza
 ```
 Output artifacts:
    - `insertion-tree.qza`: ~[view]()~ | [download](https://github.com/biocore/q2-fragment-insertion/blob/master/Example/insertion-tree.qza?raw=true)
    - `insertion-placements.qza`: ~[view]()~ | [download](https://github.com/biocore/q2-fragment-insertion/blob/master/Example/insertion-placements.qza?raw=true)
-   - `insertion-taxonomy.qza`: ~[view]()~ | [download](https://github.com/biocore/q2-fragment-insertion/blob/master/Example/insertion-taxonomy.qza?raw=true)
 
 You can then use `insertion-tree.qza` for all downstream analyses, e.g. "Alpha and beta diversity analysis", instead of `rooted-tree.qza`.
 
+### Assign taxonomy
+
+The *fragment-insertion* plugin provides an experimental method to assign a taxonomic lineage to every fragment. Assume the tips of your reference phylogeny are e.g. OTU-IDs from Greengenes (which is the case when you use the default reference). If you have a taxonomic mapping for every OTU-ID to a lineage string, as provided by Greengenes, function `classify_otus-experimental` will detect the closest OTU-IDs for every fragment in the insertion tree and report this OTU-IDs lineage string for the fragment. Thus, the function expects two required input artifacts: 1) the representative-sequences of type `FeatureData[Sequence]` and 2) the resulting tree of a previous `sepp` run which is of type `Phylogeny[Rooted]`. For the example, we also specify a third, optional input [taxonomy_gg99.qza](https://raw.githubusercontent.com/biocore/q2-fragment-insertion/master/taxonomy_gg99.qza) of type `FeatureData[Taxonomy]`.
+
+    qiime fragment-insertion classify_otus-experimental \
+      --i-representative-sequences rep-seqs.qza \
+      --i-tree insertion-tree.qza \
+      --i-reference-taxonomy taxonomy_gg99.qza \
+      --o-classification taxonomy.qza
+
+Output artifacts:
+   - `insertion-taxonomy.qza`: ~[view]()~ | [download](https://github.com/biocore/q2-fragment-insertion/blob/master/Example/insertion-taxonomy.qza?raw=true)
+
+You need to make sure, that the `--i-reference-taxonomy` matches the reference phylogeny used with function `sepp`.
+This method is experimental as of Nov. 22nd 2017, since we have not yet evaluated the quality / correctness of the returned lineages. Use on your own risk!
+
 ### Import representative sequencs into QIIME 2 artifact
 
 Assume you have a collection of representative sequences as a multiple fasta file, e.g. from downloading a `reference-hit.seqs.fa` Qiita file. You can *import* this file into a QIIME 2 artifact with the following command:
@@ -104,3 +117,13 @@ Upload the newly created conda package to biocore:
     anaconda upload -u biocore q2-fragment-insertion-0.1.0-py35h3e8d850_1.tar.bz2
 
 Remember to do that for both, Linux and OSX.
+
+## How to import taxonomy tables
+
+The plugin function `classify_otus-experimental` allows to pass in *reference taxonomic table* via argument `--i-reference-taxonomy`. You can import a tab-separated two-column table where first column is the OTU-ID and the second column is the ";" separated lineage string via the following command as an QIIME 2 artifact. Make sure your table does **not** contain a header line:
+
+    qiime tools import \
+    --input-path taxonomy.tsv \
+    --source-format HeaderlessTSVTaxonomyFormat \
+    --type "FeatureData[Taxonomy]" \
+    --output-path foo.qza
diff --git a/ci/recipe/meta.yaml b/ci/recipe/meta.yaml
@@ -1,4 +1,4 @@
-{% set version = '0.2.0' %}
+{% set version = '0.3.0' %}
 {% set qiime2release = '2017.10' %}
 
 package:

diff --git a/q2_fragment_insertion/__init__.py b/q2_fragment_insertion/__init__.py
@@ -7,8 +7,8 @@
 # ----------------------------------------------------------------------------
 import pkg_resources
 
-from ._insertion import sepp
+from ._insertion import sepp, classify_paths, classify_otus_experimental
 
 
 __version__ = pkg_resources.get_distribution('q2-fragment-insertion').version
-__all__ = ['sepp']
+__all__ = ['sepp', 'classify_paths', 'classify_otus_experimental']
diff --git a/q2_fragment_insertion/_insertion.py b/q2_fragment_insertion/_insertion.py
@@ -7,6 +7,7 @@
 # ----------------------------------------------------------------------------
 
 import os
+import sys
 import shutil
 import tempfile
 import subprocess
@@ -21,6 +22,7 @@
                                    AlignedDNASequencesDirectoryFormat,
                                    AlignedDNAIterator,
                                    AlignedDNAFASTAFormat)
+from qiime2.sdk import Artifact
 from q2_types.tree import NewickFormat
 
 from q2_fragment_insertion._format import PlacementsFormat
@@ -95,7 +97,7 @@ def _obtain_taxonomy(filename_tree: str,
                      DNASequencesDirectoryFormat) -> pd.DataFrame:
     """Buttom up traverse tree for nodes that are inserted fragments and
        collect taxonomic labels upon traversal."""
-    tree = skbio.TreeNode.read(filename_tree)
+    tree = skbio.TreeNode.read(str(filename_tree))
     taxonomy = []
     for fragment in representative_sequences.file.view(DNAIterator):
         lineage = []
@@ -108,7 +110,13 @@ def _obtain_taxonomy(filename_tree: str,
             lineage_str = np.nan
         taxonomy.append({'Feature ID': fragment.metadata['id'],
                          'Taxon': lineage_str})
-    return pd.DataFrame(taxonomy).set_index('Feature ID')
+    pd_taxonomy = pd.DataFrame(taxonomy).set_index('Feature ID')
+    if pd_taxonomy['Taxon'].dropna().shape[0] == 0:
+        raise ValueError(
+            ("None of the representative-sequences can be found in the "
+             "insertion tree. Please double check that both inputs match up, "
+             "i.e. are results from the same 'sepp' run."))
+    return pd_taxonomy
 
 
 def _sepp_path():
@@ -135,7 +143,7 @@ def sepp(representative_sequences: DNASequencesDirectoryFormat,
          threads: int=1,
          reference_alignment: AlignedDNASequencesDirectoryFormat=None,
          reference_phylogeny: NewickFormat=None
-         ) -> (NewickFormat, PlacementsFormat, pd.DataFrame):
+         ) -> (NewickFormat, PlacementsFormat):
 
     _sanity()
     # check if sequences and tips in reference match
@@ -150,7 +158,6 @@ def sepp(representative_sequences: DNASequencesDirectoryFormat,
 
     placements_result = PlacementsFormat()
     tree_result = NewickFormat()
-    taxonomy = pd.DataFrame()
 
     with tempfile.TemporaryDirectory() as tmp:
         _run(str(representative_sequences.file.view(DNAFASTAFormat)),
@@ -160,9 +167,130 @@ def sepp(representative_sequences: DNASequencesDirectoryFormat,
         outplacements = os.path.join(tmp, placements)
 
         _add_missing_branch_length(outtree)
-        taxonomy = _obtain_taxonomy(outtree, representative_sequences)
 
         shutil.copyfile(outtree, str(tree_result))
         shutil.copyfile(outplacements, str(placements_result))
 
-    return tree_result, placements_result, taxonomy
+    return tree_result, placements_result
+
+
+def classify_paths(representative_sequences: DNASequencesDirectoryFormat,
+                   tree: NewickFormat) -> pd.DataFrame:
+    return _obtain_taxonomy(str(tree), representative_sequences)
+
+
+def classify_otus_experimental(
+        representative_sequences: DNASequencesDirectoryFormat,
+        tree: NewickFormat,
+        reference_taxonomy: pd.DataFrame=None) -> pd.DataFrame:
+    if reference_taxonomy is None:
+        filename_default_taxonomy = resource_filename(
+            Requirement.parse('q2_fragment_insertion'),
+            'q2_fragment_insertion/assets/taxonomy_gg99.qza')
+        reference_taxonomy = Artifact.load(
+            filename_default_taxonomy).view(pd.DataFrame)
+
+    # convert type of feature IDs to str (depending on pandas type inference
+    # they might come as integers), to make sure they are of the same type as
+    # in the tree.
+    reference_taxonomy.index = map(str, reference_taxonomy.index)
+
+    # load the insertion tree
+    tree = skbio.TreeNode.read(str(tree))
+
+    # ensure that all reference tips in the tree (those without the inserted
+    # fragments) have a mapping in the user provided taxonomy table
+    names_tips = {node.name for node in tree.tips()}
+    names_fragments = {fragment.metadata['id']
+                       for fragment
+                       in representative_sequences.file.view(DNAIterator)}
+    missing_features = (names_tips - names_fragments) -\
+        set(reference_taxonomy.index)
+    if len(missing_features) > 0:
+        # QIIME2 users can run with --verbose and see stderr and stdout.
+        # Thus, we here report more details about the mismatch:
+        sys.stderr.write(
+            ("The taxonomy artifact you provided does not contain lineage "
+             "information for the following %i features:\n%s") %
+            (len(missing_features), "\n".join(missing_features)))
+        raise ValueError("Not all OTUs in the provided insertion tree have "
+                         "mappings in the provided reference taxonomy.")
+
+    taxonomy = []
+    for fragment in representative_sequences.file.view(DNAIterator):
+        # for every inserted fragment we now try to find the closest OTU tip
+        # in the tree and available mapping from the OTU-ID to a lineage
+        # string:
+        lineage_str = np.nan
+        # first, let us check if the fragment has been inserted at all ...
+        try:
+            curr_node = tree.find(fragment.metadata['id'])
+        except skbio.tree.MissingNodeError:
+            continue
+        # if yes, we start from the inserted node and traverse the tree as less
+        # as possible towards the root and check at every level if one or
+        # several OTU-tips are within the sub-tree.
+        if curr_node is not None:
+            foundOTUs = []
+            # Traversal is stopped at a certain level, if one or more OTU-tips
+            # have been found in the sub-tree OR ... (see break below)
+            while len(foundOTUs) == 0:
+                # SEPP insertion - especially for multiple very similar
+                # sequences - can result in a rather complex topology change
+                # if all those sequences are inserted into the same branch
+                # leading to one OTU-tip. Thus, we cannot simply visit only
+                # all siblings or decendents and rather need to traverse the
+                # whole sub-tree. Average case should be well behaved,
+                # thus I think it is ok.
+                for node in curr_node.postorder():
+                    if (node.name is not None) and \
+                       (node.name in reference_taxonomy.index):
+                        # if a suitable OTU-tip node is found AND this OTU-ID
+                        # has a mapping in the user provided reference_taxonomy
+                        # we store the OTU-ID in the growing result list
+                        foundOTUs.append(node.name)
+                # ... if the whole tree has been traversed without success,
+                # e.g. if user provided reference_taxonomy did not contain any
+                # matching OTU-IDs.
+                if curr_node.is_root():
+                    break
+                # prepare next while iteration, by changing to the parent node
+                curr_node = curr_node.parent
+
+            if len(foundOTUs) > 0:
+                # If the above method has identified exactly one OTU-tip,
+                # resulting lineage string would simple be the one provided by
+                # the user reference_taxonomy. However, if the inserted
+                # fragment cannot unambiguously places into the reference tree,
+                # the above method will find multiple OTU-IDs, which might have
+                # lineage strings in the user provided reference_taxonomy that
+                # are similar up to a certain rank and differ e.g. for genus
+                # and species.
+                # Thus, we here find the longest common prefix of all lineage
+                # strings. We don't operate per character, but per taxonomic
+                # rank. Therefore, we first "convert" every lineage sting into
+                # a list of taxa, one per rank.
+                split_lineages = []
+                for otu in foundOTUs:
+                    # find lineage string for OTU
+                    lineage = reference_taxonomy.loc[otu, 'Taxon']
+                    # necessary to split lineage apart to ensure that
+                    # the longest common prefix operates on atomic ranks
+                    # instead of characters
+                    split_lineages.append(list(
+                        map(str.strip, lineage.split(';'))))
+                # find the longest common prefix rank-wise and concatenate to
+                # one lineage string, separated by ;
+                lineage_str = "; ".join(os.path.commonprefix(split_lineages))
+            taxonomy.append({'Feature ID': fragment.metadata['id'],
+                             'Taxon': lineage_str})
+    pd_taxonomy = pd.DataFrame(taxonomy)
+    # test if dataframe is completely empty, or if no lineages could be found
+    if (len(taxonomy) == 0) or \
+       (pd_taxonomy['Taxon'].dropna().shape[0] == 0):
+        raise ValueError(
+            ("None of the representative-sequences can be found in the "
+             "insertion tree. Please double check that both inputs match up, "
+             "i.e. are results from the same 'sepp' run."))
+
+    return pd_taxonomy.set_index('Feature ID')