Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classify functions #19

Merged
merged 37 commits into from
Nov 23, 2017
Merged
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
bba0775
added some test data trees
sjanssen2 Nov 17, 2017
b840e76
registered function classify_paths
sjanssen2 Nov 17, 2017
dda87e0
added new function to qiime interface
sjanssen2 Nov 17, 2017
cb2eab1
moving taxonomy from 'sepp' into its own function 'classify-paths'.
sjanssen2 Nov 17, 2017
5d1818a
adding test data
sjanssen2 Nov 18, 2017
cdb56a7
renaming
sjanssen2 Nov 18, 2017
efafa93
info on how to import taxonomies
sjanssen2 Nov 18, 2017
97e099e
register second classfiy function
sjanssen2 Nov 18, 2017
61692e2
package default taxonomy (Greengenes 99%) into plugin
sjanssen2 Nov 18, 2017
eb5a0b6
more tests
sjanssen2 Nov 18, 2017
86e680e
expose new function classify_otus
sjanssen2 Nov 18, 2017
d828111
coding classify_otus
sjanssen2 Nov 18, 2017
cd37517
computed with classify-otus not classify-paths
sjanssen2 Nov 18, 2017
9548403
added a section to explain the classify functions
sjanssen2 Nov 18, 2017
247ff53
added missing test file
sjanssen2 Nov 18, 2017
fabfc6c
remove sequence
sjanssen2 Nov 18, 2017
8b7d104
flake8
sjanssen2 Nov 18, 2017
9f5abde
Merge branch 'master' of https://github.com/biocore/q2-fragment-inser…
sjanssen2 Nov 18, 2017
f2548e7
Merge branch 'master' of https://github.com/biocore/q2-fragment-inser…
sjanssen2 Nov 20, 2017
aed0b8c
Merge branch 'master' of https://github.com/biocore/q2-fragment-inser…
sjanssen2 Nov 21, 2017
bb1979c
addressing Antonio's comments
sjanssen2 Nov 21, 2017
122d6e9
renamed
sjanssen2 Nov 21, 2017
8780207
renaming
sjanssen2 Nov 21, 2017
b413991
q
sjanssen2 Nov 21, 2017
4de7268
smaller try except block
sjanssen2 Nov 21, 2017
e633d74
moved set_index because it will fail for empty dataframes
sjanssen2 Nov 21, 2017
5b208eb
fix empty dataframe
sjanssen2 Nov 21, 2017
53d1eea
not using os.path.join for resources, since they are NOT files
sjanssen2 Nov 21, 2017
512782f
rather than just dumping error message to stderr, I now capture it an…
sjanssen2 Nov 21, 2017
abac278
some cool edits suggested by Daniel
sjanssen2 Nov 22, 2017
69b7589
removed description of "classify-path" from readme as we agreed it is…
sjanssen2 Nov 22, 2017
7b2cfa4
commented out function registration for "classify-paths", but I want …
sjanssen2 Nov 22, 2017
5c75825
commented out tests for classify-paths
sjanssen2 Nov 22, 2017
c5c5a25
added an "experimental" label to function classify-otus
sjanssen2 Nov 22, 2017
b985b4e
bump version number
sjanssen2 Nov 22, 2017
25f8808
added extensive comments about tree traversal and lineage computation
sjanssen2 Nov 22, 2017
0324829
removed readme comments
sjanssen2 Nov 23, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified Example/insertion-taxonomy.qza
Binary file not shown.
34 changes: 28 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,10 +46,9 @@ A fragment may be reasonable to insert into multiple locations. However, downstr

## Files produced

The plugin will generate three files:
The plugin will generate two files:
1. A `Phylogeny[Rooted]` type: This is the tree with the sequences placed (which could be inserted), and are identified by their corresponding sequence IDs. You can directly use this tree for phylogenetic diversity computation like UniFrac or Faith's Phylogenetic Diversity.
2. A `Placements` type: It is a JSON object which, for every input sequence, describes the different possible placements.
3. And last a `FeatureData[Taxonomy]` type: This is a table that holds a taxonomic lineage string for every fragment inserted into the tree. The lineage is obtained by traversing the tree from the fragment tip towards the root and collecting all taxonomic labels in the reference tree along this path. Thus, taxonomy is only as good as provided reference phylogeny. Note, taxonomic labels are identified by containing two underscore characters `_` `_` as in Greengenes. **As of Nov 2017: We do NOT encourage the use of this file, since it has not been compared to existing taxonomic assignment methods. Particularly since the default reference tree is not inline with the reference taxonomy.**

## Example

Expand All @@ -59,21 +58,36 @@ Let us use the `FeatureData[Sequence]` from QIIME's tutorial as our input:

- `rep-seqs.qza`: [view](https://view.qiime2.org/?src=https%3A%2F%2Fdocs.qiime2.org%2F2017.10%2Fdata%2Ftutorials%2Fmoving-pictures%2Frep-seqs.qza) | [download](https://docs.qiime2.org/2017.10/data/tutorials/moving-pictures/rep-seqs.qza)

The following single command will produce three outputs: 1) `phylogeny.qza` is the `Phylogeny[Rooted]`, 2) `placements.qza` provides placement distributions for the fragments (you will most likely ignore this output) and 3) `classification.qza` which is a taxonomic classification for every fragment that has been inserted into the reference phylogeny and is of the type `FeatureData[Taxonomy]` (Computation might take some 10 minutes):
The following single command will produce two outputs: 1) `phylogeny.qza` is the `Phylogeny[Rooted]` and 2) `placements.qza` provides placement distributions for the fragments (you will most likely ignore this output) (Computation might take some 10 minutes):
```
qiime fragment-insertion sepp-16s-greengenes \
--i-representative-sequences rep-seqs.qza \
--o-tree insertion-tree.qza \
--o-placements insertion-placements.qza \
--o-classification insertion-taxonomy.qza
--o-placements insertion-placements.qza
```
Output artifacts:
- `insertion-tree.qza`: ~[view]()~ | [download](https://github.com/biocore/q2-fragment-insertion/blob/master/Example/insertion-tree.qza?raw=true)
- `insertion-placements.qza`: ~[view]()~ | [download](https://github.com/biocore/q2-fragment-insertion/blob/master/Example/insertion-placements.qza?raw=true)
- `insertion-taxonomy.qza`: ~[view]()~ | [download](https://github.com/biocore/q2-fragment-insertion/blob/master/Example/insertion-taxonomy.qza?raw=true)

You can then use `insertion-tree.qza` for all downstream analyses, e.g. "Alpha and beta diversity analysis", instead of `rooted-tree.qza`.

### Assign taxonomy

The *fragment-insertion* plugin provides two alternative methods to assign a taxonomic lineage to every fragment. Assume the tips of your reference phylogeny are e.g. OTU-IDs from Greengenes (which is the case when you use the default reference). If you have a taxonomic mapping for every OTU-ID to a lineage string, as provided by Greengenes, function `classify-otus` will detect the closest OTU-IDs for every fragment in the insertion tree and report this OTU-IDs lineage string for the fragment. Thus, the function expects two required input artifacts: 1) the representative-sequences of type `FeatureData[Sequence]` and 2) the resulting tree of a previous `sepp` run which is of type `Phylogeny[Rooted]`. For the example, we also specify a third, optional input [taxonomy_gg99.gza](https://raw.githubusercontent.com/biocore/q2-fragment-insertion/master/taxonomy_gg99.gza) of type `FeatureData[Taxonomy]`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is a .gza?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a stupid typo which I cannot get out of my muscle memory


qiime fragment-insertion classify-otus \
--i-representative-sequences rep-seqs.qza \
--i-tree insertion-tree.qza \
--i-reference-taxonomy taxonomy_gg99.gza \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

--o-classification taxonomy.gza

Output artifacts:
- `insertion-taxonomy.qza`: ~[view]()~ | [download](https://github.com/biocore/q2-fragment-insertion/blob/master/Example/insertion-taxonomy.qza?raw=true)

You need to make sure, that the `--i-reference-taxonomy` matches the reference phylogeny used with function `sepp`.

Alternatively, you can use the function `classify-paths` to a taxonomy. The lineage strings are obtained by traversing the insertion tree from each fragment tip towards the root and collecting all taxonomic labels in the reference tree along this path. Thus, taxonomy is only as good as provided reference phylogeny. Note, taxonomic labels are identified by containing two underscore characters `_` `_` as in Greengenes. **As of Nov 2017: We do NOT encourage the use of this function, since it has not been compared to existing taxonomic assignment methods. Particularly since the default reference tree is not inline with the reference taxonomy.**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do not recommend its use, can be remove it until it has been evaluated?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or more explicitly note the method as experimental? e.g., classify-paths-experimental? This point "Particularly since the default reference tree is not inline with the reference taxonomy." I think just serves to confuse a user, simply saying it's experimental should be fine

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and I think the cleanest solution for now is to comment the code out and remove description from Readme.md, which I did. I don't want the code to get lost, therefore I did not remove according lines, but chose to comment them out.


### Import representative sequencs into QIIME 2 artifact

Assume you have a collection of representative sequences as a multiple fasta file, e.g. from downloading a `reference-hit.seqs.fa` Qiita file. You can *import* this file into a QIIME 2 artifact with the following command:
Expand Down Expand Up @@ -104,3 +118,11 @@ Upload the newly created conda package to biocore:
anaconda upload -u biocore q2-fragment-insertion-0.1.0-py35h3e8d850_1.tar.bz2

Remember to do that for both, Linux and OSX.

## How to import taxonomy tables

qiime tools import \
--input-path taxonomy.tsv \
--source-format HeaderlessTSVTaxonomyFormat \
--type "FeatureData[Taxonomy]" \
--output-path foo.gza
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. Was a note to myself. I added some explanatory text why one want to do that.

4 changes: 2 additions & 2 deletions q2_fragment_insertion/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@
# ----------------------------------------------------------------------------
import pkg_resources

from ._insertion import sepp
from ._insertion import sepp, classify_paths, classify_otus


__version__ = pkg_resources.get_distribution('q2-fragment-insertion').version
__all__ = ['sepp']
__all__ = ['sepp', 'classify_paths', 'classify_otus']
103 changes: 97 additions & 6 deletions q2_fragment_insertion/_insertion.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
# ----------------------------------------------------------------------------

import os
import sys
import shutil
import tempfile
import subprocess
Expand All @@ -21,6 +22,7 @@
AlignedDNASequencesDirectoryFormat,
AlignedDNAIterator,
AlignedDNAFASTAFormat)
from qiime2.sdk import Artifact
from q2_types.tree import NewickFormat

from q2_fragment_insertion._format import PlacementsFormat
Expand Down Expand Up @@ -95,7 +97,7 @@ def _obtain_taxonomy(filename_tree: str,
DNASequencesDirectoryFormat) -> pd.DataFrame:
"""Buttom up traverse tree for nodes that are inserted fragments and
collect taxonomic labels upon traversal."""
tree = skbio.TreeNode.read(filename_tree)
tree = skbio.TreeNode.read(str(filename_tree))
taxonomy = []
for fragment in representative_sequences.file.view(DNAIterator):
lineage = []
Expand All @@ -108,7 +110,13 @@ def _obtain_taxonomy(filename_tree: str,
lineage_str = np.nan
taxonomy.append({'Feature ID': fragment.metadata['id'],
'Taxon': lineage_str})
return pd.DataFrame(taxonomy).set_index('Feature ID')
pd_taxonomy = pd.DataFrame(taxonomy).set_index('Feature ID')
if pd_taxonomy['Taxon'].dropna().shape[0] == 0:
raise ValueError(
("None of the representative-sequences can be found in the "
"insertion tree. Please double check that both inputs match up, "
"i.e. are results from the same 'sepp' run."))
return pd_taxonomy


def _sepp_path():
Expand All @@ -135,7 +143,7 @@ def sepp(representative_sequences: DNASequencesDirectoryFormat,
threads: int=1,
reference_alignment: AlignedDNASequencesDirectoryFormat=None,
reference_phylogeny: NewickFormat=None
) -> (NewickFormat, PlacementsFormat, pd.DataFrame):
) -> (NewickFormat, PlacementsFormat):

_sanity()
# check if sequences and tips in reference match
Expand All @@ -150,7 +158,6 @@ def sepp(representative_sequences: DNASequencesDirectoryFormat,

placements_result = PlacementsFormat()
tree_result = NewickFormat()
taxonomy = pd.DataFrame()

with tempfile.TemporaryDirectory() as tmp:
_run(str(representative_sequences.file.view(DNAFASTAFormat)),
Expand All @@ -160,9 +167,93 @@ def sepp(representative_sequences: DNASequencesDirectoryFormat,
outplacements = os.path.join(tmp, placements)

_add_missing_branch_length(outtree)
taxonomy = _obtain_taxonomy(outtree, representative_sequences)

shutil.copyfile(outtree, str(tree_result))
shutil.copyfile(outplacements, str(placements_result))

return tree_result, placements_result, taxonomy
return tree_result, placements_result


def classify_paths(representative_sequences: DNASequencesDirectoryFormat,
tree: NewickFormat) -> pd.DataFrame:
return _obtain_taxonomy(str(tree), representative_sequences)


def classify_otus(representative_sequences: DNASequencesDirectoryFormat,
tree: NewickFormat,
reference_taxonomy: pd.DataFrame=None) -> pd.DataFrame:
if reference_taxonomy is None:
filename_default_taxonomy = resource_filename(
Requirement.parse('q2_fragment_insertion'),
'q2_fragment_insertion/assets/taxonomy_gg99.qza')
reference_taxonomy = Artifact.load(
filename_default_taxonomy).view(pd.DataFrame)

# convert type of feature IDs to str (depending on pandas type inference
# they might come as integers), to make sure they are of the same type as
# in the tree.
reference_taxonomy.index = map(str, reference_taxonomy.index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this raise a nice error if they don't match? Also, could you expand/explain a bit more what's the expectation in the comments? Thanks


# load the insertion tree
tree = skbio.TreeNode.read(str(tree))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be better to use to_newick or something like that than converting to str, here you are assuming some specific behavior that might change, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tree is a QIIME2 type NewickFormat, which is inherited from model.TextFileFormat, thus it is some kind of a file and with str() I obtain the file name.
I don't see definition of "to_newick" in this type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks. I thought it was something different, perhaps adding docstrings will help with this kind of possible confusions.


# ensure that all reference tips in the tree (those without the inserted
# fragments) have a mapping in the user provided taxonomy table
names_tips = set([node.name for node in tree.tips()])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set comprehension? {x for x in thing}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, wasn't aware of this. Thanks!

names_fragments = set([fragment.metadata['id']
for fragment
in representative_sequences.file.view(DNAIterator)])
missing_features = (set(names_tips) - set(names_fragments)) -\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extraneous casts

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

set(reference_taxonomy.index)
if len(missing_features) > 0:
# QIIME2 users can run with --verbose and see stderr and stdout.
# Thus, we here report more details about the mismatch:
sys.stderr.write(
("The taxonomy artifact you provided does not contain lineage "
"information for the following %i features:\n%s") %
(len(missing_features), "\n".join(missing_features)))
raise ValueError("Not all OTUs in the provided insertion tree have "
"mappings in the provided reference taxonomy.")

taxonomy = []
for fragment in representative_sequences.file.view(DNAIterator):
lineage_str = np.nan
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a quite long try/except block, is it possible to narrow what it covers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It covers if line 212 cannot find fragment fragment.metadata['id'] in the tree object. I don't know if my code is cleaner if I have a small try except block around that line and than enclose the lengthy block with an additional if condition?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I thought and I wasn't sure what was more pythonic. Anyway, found this and I think it makes sense: https://stackoverflow.com/a/1835844. So perhaps should do the trick:

try:
   curr_node = tree.find(fragment.metadata['id'])
   foundOTUs = []
except:
   foundOTUs = [False]

BTW, just realized if len(foundOTUs) > 0: is always true once the while is done.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foundOTUs should be the empty list if the whole tree has been traversed without finding any suitable OTU. The while loop can be exited via break if root has been reached

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figure if you are confused by the long try except block, I should shorten it. Now it surrounds just one line and I am using an additional if statement

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized what was my confusion, thanks for your patience. The condition is the break. What about moving the code in the if len(foundOTUs) > 0: to the if curr_node.is_root(): and then break.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds possible, but I would prefer to leave as is, since I find that more understandable. But I can change if you strongly disagree. Let me know.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scrap my last statement! That would mean we always traverse all the way up to the root and collect ALL OTU labels in the three. Later we would find the longest common prefix if ALL labels, which is "" and thus would render this function useless. We cannot change as you suggested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the body of the while be decomposed? It could help interpretation and reduce complexity

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am lost. As the writer of this piece of code, I find it very readable, but obviously that is not true for anyone else :-/
I am open for concrete suggestions on how to change the code, but I don't know which direction I should take here.
It is a rather complex concept on how to traverse the tree and there are some edge cases to consider.
I now added some 30 lines of comments to explain my code. Does that help?

curr_node = tree.find(fragment.metadata['id'])
except skbio.tree.MissingNodeError:
curr_node = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just continue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea

if curr_node is not None:
foundOTUs = []
while len(foundOTUs) == 0:
for node in curr_node.postorder():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repeated calls to traverse subtrees is quite expensive, why not just walk .ancestors?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

due to the topology imposed by the fragment insertion we cannot rely the relationship between the found node and the closest annotated OTU tip. Therefore, we need to really investigate the full sub-tree. Unfortunately, this is expensive, but I don't see a shortcut.

Keep in mind that we do a bottom up traversal and most fragments should be inserted very close to the next OTU tip, thus I think on average this procedure is not that bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could lead to a spectacularly bad assignment though for reads from candidate phyla that get inserted deep into the tree, right? Safest assignment is ancestors, suggest only investigating cousins and descendents if there is a rational branch length threshold in place which i think will be hard to fit well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the following example:
ex
How would you limit the search in the tree for fragment "ins1_toD+E" ? Strategy of "classify-otus" is to look down to the tips, not up to the ancestors. That is the strategy of "classify-paths".

if (node.name is not None) and \
(node.name in reference_taxonomy.index):
foundOTUs.append(node.name)
if curr_node.is_root():
break
curr_node = curr_node.parent
if len(foundOTUs) > 0:
split_lineages = []
for otu in foundOTUs:
# find lineage string for OTU
lineage = reference_taxonomy.loc[otu, 'Taxon']
# necessary to split lineage apart to ensure that
# the longest common prefix operates on atomic ranks
# instead of characters
split_lineages.append(list(
map(str.strip, lineage.split(';'))))
# find the longest common prefix rank-wise and concatenate to
# one lineage string, separated by ;
lineage_str = "; ".join(os.path.commonprefix(split_lineages))
taxonomy.append({'Feature ID': fragment.metadata['id'],
'Taxon': lineage_str})
pd_taxonomy = pd.DataFrame(taxonomy)
# test if dataframe is completely empty, or if no lineages could be found
if (len(taxonomy) == 0) or \
(pd_taxonomy['Taxon'].dropna().shape[0] == 0):
raise ValueError(
("None of the representative-sequences can be found in the "
"insertion tree. Please double check that both inputs match up, "
"i.e. are results from the same 'sepp' run."))

return pd_taxonomy.set_index('Feature ID')
63 changes: 56 additions & 7 deletions q2_fragment_insertion/plugin_setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,19 +39,14 @@


_output_descriptions = {
'tree': 'The tree with inserted feature data',
'classification':
('Taxonomic lineages for fragments, obtained by traversing the insertion '
'tree bottom up and collecting taxonomic labels. Only works for '
'Greengenes lines labels, i.e. they need to contain "__" infixes.')}
'tree': 'The tree with inserted feature data'}


_parameters = {'threads': qiime2.plugin.Int}


_outputs = [('tree', Phylogeny[Rooted]),
('placements', Placements),
('classification', FeatureData[Taxonomy])]
('placements', Placements)]


plugin.methods.register_function(
Expand Down Expand Up @@ -79,4 +74,58 @@
)


plugin.methods.register_function(
function=q2fi.classify_paths,
inputs={'representative_sequences': FeatureData[Sequence],
'tree': Phylogeny[Rooted]},
input_descriptions={
'representative_sequences':
"The sequences used for a \'sepp\' run to produce the \'tree\'.",
'tree':
('The tree resulting from inserting fragments into a reference '
'phylogeny, i.e. the output of function \'sepp\'')},
parameters={},
parameter_descriptions={},
outputs=[('classification', FeatureData[Taxonomy])],
output_descriptions={
'classification': 'Taxonomic lineages for inserted fragments.'},
name=('Obtain taxonomic lineages, by collecting taxonomic labels from '
'reference phylogeny.'),
description=(
('Use the resulting tree from \'sepp\' and traverse it bottom-up to '
'obtain taxonomic lineages for every inserted fragment. Only works '
'for Greengenes lines labels, i.e. they need to contain "__" '
'infixes. Quality strongly depends on correct placements of taxonomic'
' labels in the provided reference phylogeny.'))
)


plugin.methods.register_function(
function=q2fi.classify_otus,
inputs={'representative_sequences': FeatureData[Sequence],
'tree': Phylogeny[Rooted],
'reference_taxonomy': FeatureData[Taxonomy]},
input_descriptions={
'representative_sequences':
"The sequences used for a \'sepp\' run to produce the \'tree\'.",
'tree':
('The tree resulting from inserting fragments into a reference '
'phylogeny, i.e. the output of function \'sepp\''),
'reference_taxonomy':
("Reference taxonomic table that maps every OTU-ID into a taxonomic "
"lineage string.")},
parameters={},
parameter_descriptions={},
outputs=[('classification', FeatureData[Taxonomy])],
output_descriptions={
'classification': 'Taxonomic lineages for inserted fragments.'},
name=('Obtain taxonomic lineages, by finding closest OTU in reference '
'phylogeny.'),
description=(
'Use the resulting tree from \'sepp\' and find closest OTU-ID for '
'every inserted fragment. Then, look up the reference lineage string '
'in the reference taxonomy.')
)


importlib.import_module('q2_fragment_insertion._transformer')
Binary file not shown.
Binary file not shown.
Binary file not shown.
11 changes: 11 additions & 0 deletions q2_fragment_insertion/tests/data/taxonomy_real_data_small_otus.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Feature ID Taxon
testseqa k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Ruminococcaceae; g__Ruminococcus; s__
testseqb k__Bacteria; p__Firmicutes; c__Bacilli; o__Gemellales; f__Gemellaceae; g__; s__
testseqc k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Streptococcaceae; g__Streptococcus; s__
testseqd k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Ruminococcaceae; g__Ruminococcus; s__
testseqe k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__[Tissierellaceae]; g__Anaerococcus; s__
testseqf k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae; g__Blautia; s__
testseqg k__Bacteria; p__Actinobacteria; c__Coriobacteriia; o__Coriobacteriales; f__Coriobacteriaceae; g__Adlercreutzia; s__
testseqh k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Rikenellaceae; g__; s__
testseqi k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__; s__
testseqj k__Bacteria; p__Planctomycetes; c__vadinHA49; o__DH61; f__; g__; s__
11 changes: 11 additions & 0 deletions q2_fragment_insertion/tests/data/taxonomy_real_data_tiny_otus.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Feature ID Taxon
testseqa k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Clostridiaceae; g__Clostridium; s__
testseqb k__Bacteria; p__Firmicutes; c__Bacilli; o__Gemellales; f__Gemellaceae; g__Gemella; s__
testseqc k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Streptococcaceae; g__Streptococcus; s__
testseqd k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Ruminococcaceae; g__Ruminococcus; s__
testseqe k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__[Tissierellaceae]; g__Anaerococcus; s__
testseqf k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae; g__Blautia; s__
testseqg k__Bacteria; p__Actinobacteria; c__Coriobacteriia; o__Coriobacteriales; f__Coriobacteriaceae; g__; s__
testseqh k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__; g__; s__
testseqi k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Ruminococcaceae; g__; s__
testseqj k__Bacteria; p__Proteobacteria; c__Deltaproteobacteria; o__Desulfovibrionales; f__Desulfovibrionaceae; g__Lawsonia; s__
Loading