qiime2 · sjanssen2 · Nov 23, 2017 · Nov 17, 2017 · Nov 17, 2017 · Nov 17, 2017
diff --git a/Example/insertion-taxonomy.qza b/Example/insertion-taxonomy.qza
diff --git a/README.md b/README.md
@@ -46,10 +46,9 @@ A fragment may be reasonable to insert into multiple locations. However, downstr
 
 ## Files produced
 
-The plugin will generate three files:
+The plugin will generate two files:
   1. A `Phylogeny[Rooted]` type: This is the tree with the sequences placed (which could be inserted), and are identified by their corresponding sequence IDs. You can directly use this tree for phylogenetic diversity computation like UniFrac or Faith's Phylogenetic Diversity.
   2. A `Placements` type: It is a JSON object which, for every input sequence, describes the different possible placements.
-  3. And last a `FeatureData[Taxonomy]` type: This is a table that holds a taxonomic lineage string for every fragment inserted into the tree. The lineage is obtained by traversing the tree from the fragment tip towards the root and collecting all taxonomic labels in the reference tree along this path. Thus, taxonomy is only as good as provided reference phylogeny. Note, taxonomic labels are identified by containing two underscore characters `_` `_` as in Greengenes. **As of Nov 2017: We do NOT encourage the use of this file, since it has not been compared to existing taxonomic assignment methods. Particularly since the default reference tree is not inline with the reference taxonomy.**
 
 ## Example
 
@@ -59,21 +58,36 @@ Let us use the `FeatureData[Sequence]` from QIIME's tutorial as our input:
 
    - `rep-seqs.qza`: [view](https://view.qiime2.org/?src=https%3A%2F%2Fdocs.qiime2.org%2F2017.10%2Fdata%2Ftutorials%2Fmoving-pictures%2Frep-seqs.qza) | [download](https://docs.qiime2.org/2017.10/data/tutorials/moving-pictures/rep-seqs.qza)
 
-The following single command will produce three outputs: 1) `phylogeny.qza` is the `Phylogeny[Rooted]`, 2) `placements.qza` provides placement distributions for the fragments (you will most likely ignore this output) and 3) `classification.qza` which is a taxonomic classification for every fragment that has been inserted into the reference phylogeny and is of the type `FeatureData[Taxonomy]` (Computation might take some 10 minutes):
+The following single command will produce two outputs: 1) `phylogeny.qza` is the `Phylogeny[Rooted]` and 2) `placements.qza` provides placement distributions for the fragments (you will most likely ignore this output) (Computation might take some 10 minutes):
 ```
 qiime fragment-insertion sepp-16s-greengenes \
   --i-representative-sequences rep-seqs.qza \
   --o-tree insertion-tree.qza \
-  --o-placements insertion-placements.qza \
-  --o-classification insertion-taxonomy.qza
+  --o-placements insertion-placements.qza
 ```
 Output artifacts:
    - `insertion-tree.qza`: ~[view]()~ | [download](https://github.com/biocore/q2-fragment-insertion/blob/master/Example/insertion-tree.qza?raw=true)
    - `insertion-placements.qza`: ~[view]()~ | [download](https://github.com/biocore/q2-fragment-insertion/blob/master/Example/insertion-placements.qza?raw=true)
-   - `insertion-taxonomy.qza`: ~[view]()~ | [download](https://github.com/biocore/q2-fragment-insertion/blob/master/Example/insertion-taxonomy.qza?raw=true)
 
 You can then use `insertion-tree.qza` for all downstream analyses, e.g. "Alpha and beta diversity analysis", instead of `rooted-tree.qza`.
 
+### Assign taxonomy
+
+The *fragment-insertion* plugin provides two alternative methods to assign a taxonomic lineage to every fragment. Assume the tips of your reference phylogeny are e.g. OTU-IDs from Greengenes (which is the case when you use the default reference). If you have a taxonomic mapping for every OTU-ID to a lineage string, as provided by Greengenes, function `classify-otus` will detect the closest OTU-IDs for every fragment in the insertion tree and report this OTU-IDs lineage string for the fragment. Thus, the function expects two required input artifacts: 1) the representative-sequences of type `FeatureData[Sequence]` and 2) the resulting tree of a previous `sepp` run which is of type `Phylogeny[Rooted]`. For the example, we also specify a third, optional input [taxonomy_gg99.gza](https://raw.githubusercontent.com/biocore/q2-fragment-insertion/master/taxonomy_gg99.gza) of type `FeatureData[Taxonomy]`.
+
+    qiime fragment-insertion classify-otus \
+      --i-representative-sequences rep-seqs.qza \
+      --i-tree insertion-tree.qza \
+      --i-reference-taxonomy taxonomy_gg99.gza \
+      --o-classification taxonomy.gza
+
+Output artifacts:
+   - `insertion-taxonomy.qza`: ~[view]()~ | [download](https://github.com/biocore/q2-fragment-insertion/blob/master/Example/insertion-taxonomy.qza?raw=true)
+
+You need to make sure, that the `--i-reference-taxonomy` matches the reference phylogeny used with function `sepp`.
+
+Alternatively, you can use the function `classify-paths` to a taxonomy. The lineage strings are obtained by traversing the insertion tree from each fragment tip towards the root and collecting all taxonomic labels in the reference tree along this path. Thus, taxonomy is only as good as provided reference phylogeny. Note, taxonomic labels are identified by containing two underscore characters `_` `_` as in Greengenes. **As of Nov 2017: We do NOT encourage the use of this function, since it has not been compared to existing taxonomic assignment methods. Particularly since the default reference tree is not inline with the reference taxonomy.**
+
 ### Import representative sequencs into QIIME 2 artifact
 
 Assume you have a collection of representative sequences as a multiple fasta file, e.g. from downloading a `reference-hit.seqs.fa` Qiita file. You can *import* this file into a QIIME 2 artifact with the following command:
@@ -104,3 +118,11 @@ Upload the newly created conda package to biocore:
     anaconda upload -u biocore q2-fragment-insertion-0.1.0-py35h3e8d850_1.tar.bz2
 
 Remember to do that for both, Linux and OSX.
+
+## How to import taxonomy tables
+
+    qiime tools import \
+    --input-path taxonomy.tsv \
+    --source-format HeaderlessTSVTaxonomyFormat \
+    --type "FeatureData[Taxonomy]" \
+    --output-path foo.gza
diff --git a/q2_fragment_insertion/__init__.py b/q2_fragment_insertion/__init__.py
@@ -7,8 +7,8 @@
 # ----------------------------------------------------------------------------
 import pkg_resources
 
-from ._insertion import sepp
+from ._insertion import sepp, classify_paths, classify_otus
 
 
 __version__ = pkg_resources.get_distribution('q2-fragment-insertion').version
-__all__ = ['sepp']
+__all__ = ['sepp', 'classify_paths', 'classify_otus']
diff --git a/q2_fragment_insertion/_insertion.py b/q2_fragment_insertion/_insertion.py
@@ -7,6 +7,7 @@
 # ----------------------------------------------------------------------------
 
 import os
+import sys
 import shutil
 import tempfile
 import subprocess
@@ -21,6 +22,7 @@
                                    AlignedDNASequencesDirectoryFormat,
                                    AlignedDNAIterator,
                                    AlignedDNAFASTAFormat)
+from qiime2.sdk import Artifact
 from q2_types.tree import NewickFormat
 
 from q2_fragment_insertion._format import PlacementsFormat
@@ -95,7 +97,7 @@ def _obtain_taxonomy(filename_tree: str,
                      DNASequencesDirectoryFormat) -> pd.DataFrame:
     """Buttom up traverse tree for nodes that are inserted fragments and
        collect taxonomic labels upon traversal."""
-    tree = skbio.TreeNode.read(filename_tree)
+    tree = skbio.TreeNode.read(str(filename_tree))
     taxonomy = []
     for fragment in representative_sequences.file.view(DNAIterator):
         lineage = []
@@ -108,7 +110,13 @@ def _obtain_taxonomy(filename_tree: str,
             lineage_str = np.nan
         taxonomy.append({'Feature ID': fragment.metadata['id'],
                          'Taxon': lineage_str})
-    return pd.DataFrame(taxonomy).set_index('Feature ID')
+    pd_taxonomy = pd.DataFrame(taxonomy).set_index('Feature ID')
+    if pd_taxonomy['Taxon'].dropna().shape[0] == 0:
+        raise ValueError(
+            ("None of the representative-sequences can be found in the "
+             "insertion tree. Please double check that both inputs match up, "
+             "i.e. are results from the same 'sepp' run."))
+    return pd_taxonomy
 
 
 def _sepp_path():
@@ -135,7 +143,7 @@ def sepp(representative_sequences: DNASequencesDirectoryFormat,
          threads: int=1,
          reference_alignment: AlignedDNASequencesDirectoryFormat=None,
          reference_phylogeny: NewickFormat=None
-         ) -> (NewickFormat, PlacementsFormat, pd.DataFrame):
+         ) -> (NewickFormat, PlacementsFormat):
 
     _sanity()
     # check if sequences and tips in reference match
@@ -150,7 +158,6 @@ def sepp(representative_sequences: DNASequencesDirectoryFormat,
 
     placements_result = PlacementsFormat()
     tree_result = NewickFormat()
-    taxonomy = pd.DataFrame()
 
     with tempfile.TemporaryDirectory() as tmp:
         _run(str(representative_sequences.file.view(DNAFASTAFormat)),
@@ -160,9 +167,93 @@ def sepp(representative_sequences: DNASequencesDirectoryFormat,
         outplacements = os.path.join(tmp, placements)
 
         _add_missing_branch_length(outtree)
-        taxonomy = _obtain_taxonomy(outtree, representative_sequences)
 
         shutil.copyfile(outtree, str(tree_result))
         shutil.copyfile(outplacements, str(placements_result))
 
-    return tree_result, placements_result, taxonomy
+    return tree_result, placements_result
+
+
+def classify_paths(representative_sequences: DNASequencesDirectoryFormat,
+                   tree: NewickFormat) -> pd.DataFrame:
+    return _obtain_taxonomy(str(tree), representative_sequences)
+
+
+def classify_otus(representative_sequences: DNASequencesDirectoryFormat,
+                  tree: NewickFormat,
+                  reference_taxonomy: pd.DataFrame=None) -> pd.DataFrame:
+    if reference_taxonomy is None:
+        filename_default_taxonomy = resource_filename(
+            Requirement.parse('q2_fragment_insertion'),
+            'q2_fragment_insertion/assets/taxonomy_gg99.qza')
+        reference_taxonomy = Artifact.load(
+            filename_default_taxonomy).view(pd.DataFrame)
+
+    # convert type of feature IDs to str (depending on pandas type inference
+    # they might come as integers), to make sure they are of the same type as
+    # in the tree.
+    reference_taxonomy.index = map(str, reference_taxonomy.index)
+
+    # load the insertion tree
+    tree = skbio.TreeNode.read(str(tree))
+
+    # ensure that all reference tips in the tree (those without the inserted
+    # fragments) have a mapping in the user provided taxonomy table
+    names_tips = set([node.name for node in tree.tips()])
+    names_fragments = set([fragment.metadata['id']
+                           for fragment
+                           in representative_sequences.file.view(DNAIterator)])
+    missing_features = (set(names_tips) - set(names_fragments)) -\
+        set(reference_taxonomy.index)
+    if len(missing_features) > 0:
+        # QIIME2 users can run with --verbose and see stderr and stdout.
+        # Thus, we here report more details about the mismatch:
+        sys.stderr.write(
+            ("The taxonomy artifact you provided does not contain lineage "
+             "information for the following %i features:\n%s") %
+            (len(missing_features), "\n".join(missing_features)))
+        raise ValueError("Not all OTUs in the provided insertion tree have "
+                         "mappings in the provided reference taxonomy.")
+
+    taxonomy = []
+    for fragment in representative_sequences.file.view(DNAIterator):
+        lineage_str = np.nan
+        try:
+            curr_node = tree.find(fragment.metadata['id'])
+        except skbio.tree.MissingNodeError:
+            curr_node = None
+        if curr_node is not None:
+            foundOTUs = []
+            while len(foundOTUs) == 0:
+                for node in curr_node.postorder():
+                    if (node.name is not None) and \
+                       (node.name in reference_taxonomy.index):
+                        foundOTUs.append(node.name)
+                if curr_node.is_root():
+                    break
+                curr_node = curr_node.parent
+            if len(foundOTUs) > 0:
+                split_lineages = []
+                for otu in foundOTUs:
+                    # find lineage string for OTU
+                    lineage = reference_taxonomy.loc[otu, 'Taxon']
+                    # necessary to split lineage apart to ensure that
+                    # the longest common prefix operates on atomic ranks
+                    # instead of characters
+                    split_lineages.append(list(
+                        map(str.strip, lineage.split(';'))))
+                # find the longest common prefix rank-wise and concatenate to
+                # one lineage string, separated by ;
+                lineage_str = "; ".join(os.path.commonprefix(split_lineages))
+            taxonomy.append({'Feature ID': fragment.metadata['id'],
+                             'Taxon': lineage_str})
+    pd_taxonomy = pd.DataFrame(taxonomy)
+    # test if dataframe is completely empty, or if no lineages could be found
+    if (len(taxonomy) == 0) or \
+       (pd_taxonomy['Taxon'].dropna().shape[0] == 0):
+        raise ValueError(
+            ("None of the representative-sequences can be found in the "
+             "insertion tree. Please double check that both inputs match up, "
+             "i.e. are results from the same 'sepp' run."))
+
+    return pd_taxonomy.set_index('Feature ID')
diff --git a/q2_fragment_insertion/plugin_setup.py b/q2_fragment_insertion/plugin_setup.py
@@ -39,19 +39,14 @@
 
 
 _output_descriptions = {
-    'tree': 'The tree with inserted feature data',
-    'classification':
-    ('Taxonomic lineages for fragments, obtained by traversing the insertion '
-     'tree bottom up and collecting taxonomic labels. Only works for '
-     'Greengenes lines labels, i.e. they need to contain "__" infixes.')}
+    'tree': 'The tree with inserted feature data'}
 
 
 _parameters = {'threads': qiime2.plugin.Int}
 
 
 _outputs = [('tree', Phylogeny[Rooted]),
-            ('placements', Placements),
-            ('classification', FeatureData[Taxonomy])]
+            ('placements', Placements)]
 
 
 plugin.methods.register_function(
@@ -79,4 +74,58 @@
 )
 
 
+plugin.methods.register_function(
+    function=q2fi.classify_paths,
+    inputs={'representative_sequences': FeatureData[Sequence],
+            'tree': Phylogeny[Rooted]},
+    input_descriptions={
+        'representative_sequences':
+        "The sequences used for a \'sepp\' run to produce the \'tree\'.",
+        'tree':
+        ('The tree resulting from inserting fragments into a reference '
+         'phylogeny, i.e. the output of function \'sepp\'')},
+    parameters={},
+    parameter_descriptions={},
+    outputs=[('classification', FeatureData[Taxonomy])],
+    output_descriptions={
+        'classification': 'Taxonomic lineages for inserted fragments.'},
+    name=('Obtain taxonomic lineages, by collecting taxonomic labels from '
+          'reference phylogeny.'),
+    description=(
+        ('Use the resulting tree from \'sepp\' and traverse it bottom-up to '
+         'obtain taxonomic lineages for every inserted fragment. Only works '
+         'for Greengenes lines labels, i.e. they need to contain "__" '
+         'infixes. Quality strongly depends on correct placements of taxonomic'
+         ' labels in the provided reference phylogeny.'))
+)
+
+
+plugin.methods.register_function(
+    function=q2fi.classify_otus,
+    inputs={'representative_sequences': FeatureData[Sequence],
+            'tree': Phylogeny[Rooted],
+            'reference_taxonomy': FeatureData[Taxonomy]},
+    input_descriptions={
+        'representative_sequences':
+        "The sequences used for a \'sepp\' run to produce the \'tree\'.",
+        'tree':
+        ('The tree resulting from inserting fragments into a reference '
+         'phylogeny, i.e. the output of function \'sepp\''),
+        'reference_taxonomy':
+        ("Reference taxonomic table that maps every OTU-ID into a taxonomic "
+         "lineage string.")},
+    parameters={},
+    parameter_descriptions={},
+    outputs=[('classification', FeatureData[Taxonomy])],
+    output_descriptions={
+        'classification': 'Taxonomic lineages for inserted fragments.'},
+    name=('Obtain taxonomic lineages, by finding closest OTU in reference '
+          'phylogeny.'),
+    description=(
+        'Use the resulting tree from \'sepp\' and find closest OTU-ID for '
+        'every inserted fragment. Then, look up the reference lineage string '
+        'in the reference taxonomy.')
+)
+
+
 importlib.import_module('q2_fragment_insertion._transformer')
diff --git a/q2_fragment_insertion/tests/data/sepp_tree_small.qza b/q2_fragment_insertion/tests/data/sepp_tree_small.qza
diff --git a/q2_fragment_insertion/tests/data/sepp_tree_tiny.qza b/q2_fragment_insertion/tests/data/sepp_tree_tiny.qza
diff --git a/q2_fragment_insertion/tests/data/taxonomy_missingotus.qza b/q2_fragment_insertion/tests/data/taxonomy_missingotus.qza
diff --git a/q2_fragment_insertion/tests/data/taxonomy_real_data_small_otus.tsv b/q2_fragment_insertion/tests/data/taxonomy_real_data_small_otus.tsv
@@ -0,0 +1,11 @@
+Feature ID	Taxon
+testseqa	k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Ruminococcaceae; g__Ruminococcus; s__
+testseqb	k__Bacteria; p__Firmicutes; c__Bacilli; o__Gemellales; f__Gemellaceae; g__; s__
+testseqc	k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Streptococcaceae; g__Streptococcus; s__
+testseqd	k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Ruminococcaceae; g__Ruminococcus; s__
+testseqe	k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__[Tissierellaceae]; g__Anaerococcus; s__
+testseqf	k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae; g__Blautia; s__
+testseqg	k__Bacteria; p__Actinobacteria; c__Coriobacteriia; o__Coriobacteriales; f__Coriobacteriaceae; g__Adlercreutzia; s__
+testseqh	k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Rikenellaceae; g__; s__
+testseqi	k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__; s__
+testseqj	k__Bacteria; p__Planctomycetes; c__vadinHA49; o__DH61; f__; g__; s__
diff --git a/...n/tests/data/taxonomy_real_data_small.tsv → ...s/data/taxonomy_real_data_small_paths.tsv b/...n/tests/data/taxonomy_real_data_small.tsv → ...s/data/taxonomy_real_data_small_paths.tsv
diff --git a/q2_fragment_insertion/tests/data/taxonomy_real_data_tiny_otus.tsv b/q2_fragment_insertion/tests/data/taxonomy_real_data_tiny_otus.tsv
@@ -0,0 +1,11 @@
+Feature ID	Taxon
+testseqa	k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Clostridiaceae; g__Clostridium; s__
+testseqb	k__Bacteria; p__Firmicutes; c__Bacilli; o__Gemellales; f__Gemellaceae; g__Gemella; s__
+testseqc	k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Streptococcaceae; g__Streptococcus; s__
+testseqd	k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Ruminococcaceae; g__Ruminococcus; s__
+testseqe	k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__[Tissierellaceae]; g__Anaerococcus; s__
+testseqf	k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae; g__Blautia; s__
+testseqg	k__Bacteria; p__Actinobacteria; c__Coriobacteriia; o__Coriobacteriales; f__Coriobacteriaceae; g__; s__
+testseqh	k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__; g__; s__
+testseqi	k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Ruminococcaceae; g__; s__
+testseqj	k__Bacteria; p__Proteobacteria; c__Deltaproteobacteria; o__Desulfovibrionales; f__Desulfovibrionaceae; g__Lawsonia; s__
diff --git a/...on/tests/data/taxonomy_real_data_tiny.tsv → ...ts/data/taxonomy_real_data_tiny_paths.tsv b/...on/tests/data/taxonomy_real_data_tiny.tsv → ...ts/data/taxonomy_real_data_tiny_paths.tsv