Classify functions #19

sjanssen2 · 2017-11-18T00:51:23Z

Addressing #18 , I here created to QIIME 2 functions "classify-otus" and "classify-paths" for two alternative strategies to assign taxonomic labels to inserted fragments.

…tion into classify-functions Conflicts: README.md

…tion into classify-functions

sjanssen2 · 2017-11-20T21:07:05Z

I split the functionality of the plugin into "sepp" and "classify-x", to enable providing alternatives for "x", see issue #19
@wasade @antgonza could you also review this one please?

antgonza

Forgot to submit review, sorry.

antgonza · 2017-11-20T21:15:27Z

q2_fragment_insertion/_insertion.py

+            filename_default_taxonomy).view(pd.DataFrame)
+
+    # ensure feature IDs are strings to match IDs from the tree
+    reference_taxonomy.index = map(str, reference_taxonomy.index)


Should this raise a nice error if they don't match? Also, could you expand/explain a bit more what's the expectation in the comments? Thanks

antgonza · 2017-11-20T21:18:03Z

q2_fragment_insertion/_insertion.py

+    reference_taxonomy.index = map(str, reference_taxonomy.index)
+
+    # load the insertion tree
+    tree = skbio.TreeNode.read(str(tree))


Should it be better to use to_newick or something like that than converting to str, here you are assuming some specific behavior that might change, right?

tree is a QIIME2 type NewickFormat, which is inherited from model.TextFileFormat, thus it is some kind of a file and with str() I obtain the file name.
I don't see definition of "to_newick" in this type.

Got it, thanks. I thought it was something different, perhaps adding docstrings will help with this kind of possible confusions.

antgonza · 2017-11-20T21:19:11Z

q2_fragment_insertion/_insertion.py

+    names_fragments = set([fragment.metadata['id']
+                           for fragment
+                           in representative_sequences.file.view(DNAIterator)])
+    if len((set(names_tips) - set(names_fragments)) -


Obviously, this is a nice and concise error but not sure if in some cases, perhaps when verbose is on, it will be good to display al the discrepancies.

good point. How do I check if verbose is set?

Not sure, I think this is a good question for the Q2 forum ...

got help from slack and am now reporting on stderr which feature IDs are missing. User can see that if verbose is set.

antgonza · 2017-11-20T21:20:27Z

q2_fragment_insertion/_insertion.py

+    taxonomy = []
+    for fragment in representative_sequences.file.view(DNAIterator):
+        lineage_str = np.nan
+        try:


This is a quite long try/except block, is it possible to narrow what it covers?

It covers if line 212 cannot find fragment fragment.metadata['id'] in the tree object. I don't know if my code is cleaner if I have a small try except block around that line and than enclose the lengthy block with an additional if condition?

That's what I thought and I wasn't sure what was more pythonic. Anyway, found this and I think it makes sense: https://stackoverflow.com/a/1835844. So perhaps should do the trick:

try: curr_node = tree.find(fragment.metadata['id']) foundOTUs = [] except: foundOTUs = [False]

BTW, just realized if len(foundOTUs) > 0: is always true once the while is done.

foundOTUs should be the empty list if the whole tree has been traversed without finding any suitable OTU. The while loop can be exited via break if root has been reached

I figure if you are confused by the long try except block, I should shorten it. Now it surrounds just one line and I am using an additional if statement

Just realized what was my confusion, thanks for your patience. The condition is the break. What about moving the code in the if len(foundOTUs) > 0: to the if curr_node.is_root(): and then break.

sounds possible, but I would prefer to leave as is, since I find that more understandable. But I can change if you strongly disagree. Let me know.

Scrap my last statement! That would mean we always traverse all the way up to the root and collect ALL OTU labels in the three. Later we would find the longest common prefix if ALL labels, which is "" and thus would render this function useless. We cannot change as you suggested.

Should the body of the while be decomposed? It could help interpretation and reduce complexity

I am lost. As the writer of this piece of code, I find it very readable, but obviously that is not true for anyone else :-/
I am open for concrete suggestions on how to change the code, but I don't know which direction I should take here.
It is a rather complex concept on how to traverse the tree and there are some edge cases to consider.
I now added some 30 lines of comments to explain my code. Does that help?

…tion into classify-functions Conflicts: setup.py

sjanssen2 · 2017-11-21T23:08:28Z

@wasade could you provide a second review?

…d check for content in unittests

sjanssen2 · 2017-11-22T00:39:03Z

Hey @qiyunzhu, do you have some capacity to review? Thanks!

wasade

Sorry if comments are short, on phpne

wasade · 2017-11-21T05:45:39Z

README.md


 You can then use `insertion-tree.qza` for all downstream analyses, e.g. "Alpha and beta diversity analysis", instead of `rooted-tree.qza`.

+### Assign taxonomy
+
+The *fragment-insertion* plugin provides two alternative methods to assign a taxonomic lineage to every fragment. Assume the tips of your reference phylogeny are e.g. OTU-IDs from Greengenes (which is the case when you use the default reference). If you have a taxonomic mapping for every OTU-ID to a lineage string, as provided by Greengenes, function `classify-otus` will detect the closest OTU-IDs for every fragment in the insertion tree and report this OTU-IDs lineage string for the fragment. Thus, the function expects two required input artifacts: 1) the representative-sequences of type `FeatureData[Sequence]` and 2) the resulting tree of a previous `sepp` run which is of type `Phylogeny[Rooted]`. For the example, we also specify a third, optional input [taxonomy_gg99.gza](https://raw.githubusercontent.com/biocore/q2-fragment-insertion/master/taxonomy_gg99.gza) of type `FeatureData[Taxonomy]`.


what is a .gza?

a stupid typo which I cannot get out of my muscle memory

wasade · 2017-11-21T05:45:46Z

README.md

+    qiime fragment-insertion classify-otus \
+      --i-representative-sequences rep-seqs.qza \
+      --i-tree insertion-tree.qza \
+      --i-reference-taxonomy taxonomy_gg99.gza \


wasade · 2017-11-21T05:46:36Z

README.md

+
+You need to make sure, that the `--i-reference-taxonomy` matches the reference phylogeny used with function `sepp`.
+
+Alternatively, you can use the function `classify-paths` to a taxonomy. The lineage strings are obtained by traversing the insertion tree from each fragment tip towards the root and collecting all taxonomic labels in the reference tree along this path. Thus, taxonomy is only as good as provided reference phylogeny. Note, taxonomic labels are identified by containing two underscore characters `_` `_` as in Greengenes. **As of Nov 2017: We do NOT encourage the use of this function, since it has not been compared to existing taxonomic assignment methods. Particularly since the default reference tree is not inline with the reference taxonomy.**


If we do not recommend its use, can be remove it until it has been evaluated?

or more explicitly note the method as experimental? e.g., classify-paths-experimental? This point "Particularly since the default reference tree is not inline with the reference taxonomy." I think just serves to confuse a user, simply saying it's experimental should be fine

I agree and I think the cleanest solution for now is to comment the code out and remove description from Readme.md, which I did. I don't want the code to get lost, therefore I did not remove according lines, but chose to comment them out.

wasade · 2017-11-21T05:48:02Z

README.md

+    --input-path taxonomy.tsv \
+    --source-format HeaderlessTSVTaxonomyFormat \
+    --type "FeatureData[Taxonomy]" \
+    --output-path foo.gza


good catch. Was a note to myself. I added some explanatory text why one want to do that.

wasade · 2017-11-22T01:40:09Z

q2_fragment_insertion/_insertion.py

+
+    # ensure that all reference tips in the tree (those without the inserted
+    # fragments) have a mapping in the user provided taxonomy table
+    names_tips = set([node.name for node in tree.tips()])


Set comprehension? {x for x in thing}

cool, wasn't aware of this. Thanks!

wasade · 2017-11-22T01:40:48Z

q2_fragment_insertion/_insertion.py

+    names_fragments = set([fragment.metadata['id']
+                           for fragment
+                           in representative_sequences.file.view(DNAIterator)])
+    missing_features = (set(names_tips) - set(names_fragments)) -\


Extraneous casts

wasade · 2017-11-22T01:42:07Z

q2_fragment_insertion/_insertion.py

+        try:
+            curr_node = tree.find(fragment.metadata['id'])
+        except skbio.tree.MissingNodeError:
+            curr_node = None


Why not just continue?

wasade · 2017-11-22T01:43:41Z

q2_fragment_insertion/_insertion.py

+        if curr_node is not None:
+            foundOTUs = []
+            while len(foundOTUs) == 0:
+                for node in curr_node.postorder():


The repeated calls to traverse subtrees is quite expensive, why not just walk .ancestors?

due to the topology imposed by the fragment insertion we cannot rely the relationship between the found node and the closest annotated OTU tip. Therefore, we need to really investigate the full sub-tree. Unfortunately, this is expensive, but I don't see a shortcut.

Keep in mind that we do a bottom up traversal and most fragments should be inserted very close to the next OTU tip, thus I think on average this procedure is not that bad.

That could lead to a spectacularly bad assignment though for reads from candidate phyla that get inserted deep into the tree, right? Safest assignment is ancestors, suggest only investigating cousins and descendents if there is a rational branch length threshold in place which i think will be hard to fit well

Consider the following example:

How would you limit the search in the tree for fragment "ins1_toD+E" ? Strategy of "classify-otus" is to look down to the tips, not up to the ancestors. That is the strategy of "classify-paths".

… to experimental

…to keep the code here

wasade · 2017-11-22T02:39:46Z

Is this the LCA of the taxonomy for the proximal otus? Or just finding a OTU nearby? If the latter there are a few scenarios where the false positive would be pretty bad. Has this method been benchmarked? Note that caching tips informstion across the tree would avoid full subtree traversals

…

On Nov 21, 2017 7:26 PM, "Stefan Janssen" ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In q2_fragment_insertion/_insertion.py <#19 (comment)> : > + "information for the following %i features:\n%s") % + (len(missing_features), "\n".join(missing_features))) + raise ValueError("Not all OTUs in the provided insertion tree have " + "mappings in the provided reference taxonomy.") + + taxonomy = [] + for fragment in representative_sequences.file.view(DNAIterator): + lineage_str = np.nan + try: + curr_node = tree.find(fragment.metadata['id']) + except skbio.tree.MissingNodeError: + curr_node = None + if curr_node is not None: + foundOTUs = [] + while len(foundOTUs) == 0: + for node in curr_node.postorder(): Consider the following example: [image: ex] <https://user-images.githubusercontent.com/11960616/33106843-27b519d6-cee9-11e7-9366-7e736989d19a.png> How would you limit the search in the tree for fragment "ins1_toD+E" ? Strategy of "classify-otus" is to look down to the tips, not up to the ancestors. That is the strategy of "classify-paths". — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAc8ssf5UdLAGELEMEmm6kN4kU3ecmLrks5s44ZMgaJpZM4QiziW> .

sjanssen2 · 2017-11-22T03:00:59Z

It is a hybrid. First, trying to find the closest OTUs by traversing the tree from the insertion tip node bottom up until one or several equally good OTU tips are found. For all found OTUs we look up the lineage string and return the longest common prefix of the lineage strings. Note, we don't operate on single characters but on rank names, i.e. all separated by "; ".

It is not benchmarkes and I think a benchmark of taxonomic assignment performance would take a couple further weeks if not months. But I really like to push this plugin out soon. We could mark this function as experimental if you like

wasade · 2017-11-22T03:41:38Z

Sounds good thanks! I think denoting experimental explicitly would be helpful

…

On Nov 21, 2017 8:01 PM, "Stefan Janssen" ***@***.***> wrote: It is a hybrid. First, trying to find the closest OTUs by traversing the tree from the insertion tip node bottom up until one or several equally good OTU tips are found. For all found OTUs we look up the lineage string and return the longest common prefix of the lineage strings. Note, we don't operate on single characters but on rank names, i.e. all separated by "; ". It is not benchmarkes and I think a benchmark of taxonomic assignment performance would take a couple further weeks if not months. But I really like to push this plugin out soon. We could mark this function as experimental if you like — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAc8snSiBo0FxO3R3A6Uu_CpGB3kHAcRks5s445rgaJpZM4QiziW> .

sjanssen2 · 2017-11-22T04:46:50Z

OK. Do you happen to know how I properly flag this function as being experimental?

wasade · 2017-11-22T07:28:45Z

Change the method name so it has experimental in it? What I think would be helpful is for a QIIME2 user on the command line to have to type qiime fragment-insertion classify-otus-experimental ... to proceed, does that make sense?

sjanssen2 · 2017-11-22T17:02:31Z

@wasade @antgonza OK now?

sjanssen2 · 2017-11-22T19:13:58Z

ping @wasade

wasade · 2017-11-22T22:17:48Z

q2_fragment_insertion/plugin_setup.py

@@ -79,4 +74,70 @@
 )


+# text for readme, if we decide to put function 'classify_paths' back in:


Can these comments just be deleted?

you mean the whole block of out-commented code or just one specific line?

@wasade another ping :-)

wasade · 2017-11-22T22:18:35Z

One comment, 👍 otherwise

sjanssen2 added 19 commits November 17, 2017 11:49

added some test data trees

bba0775

registered function classify_paths

b840e76

added new function to qiime interface

dda87e0

moving taxonomy from 'sepp' into its own function 'classify-paths'.

cb2eab1

adding test data

5d1818a

renaming

cdb56a7

info on how to import taxonomies

efafa93

register second classfiy function

97e099e

package default taxonomy (Greengenes 99%) into plugin

61692e2

more tests

eb5a0b6

expose new function classify_otus

86e680e

coding classify_otus

d828111

computed with classify-otus not classify-paths

cd37517

added a section to explain the classify functions

9548403

added missing test file

247ff53

remove sequence

fabfc6c

flake8

8b7d104

Merge branch 'master' of https://github.com/biocore/q2-fragment-inser…

9f5abde

…tion into classify-functions Conflicts: README.md

Merge branch 'master' of https://github.com/biocore/q2-fragment-inser…

f2548e7

…tion into classify-functions

antgonza reviewed Nov 21, 2017

View reviewed changes

sjanssen2 added 8 commits November 21, 2017 08:28

Merge branch 'master' of https://github.com/biocore/q2-fragment-inser…

aed0b8c

…tion into classify-functions Conflicts: setup.py

addressing Antonio's comments

bb1979c

renamed

122d6e9

renaming

8780207

q

b413991

smaller try except block

4de7268

moved set_index because it will fail for empty dataframes

e633d74

fix empty dataframe

5b208eb

sjanssen2 added 2 commits November 21, 2017 15:45

not using os.path.join for resources, since they are NOT files

53d1eea

rather than just dumping error message to stderr, I now capture it an…

512782f

…d check for content in unittests

wasade requested changes Nov 22, 2017

View reviewed changes

sjanssen2 added 4 commits November 21, 2017 18:26

some cool edits suggested by Daniel

abac278

removed description of "classify-path" from readme as we agreed it is…

69b7589

… to experimental

commented out function registration for "classify-paths", but I want …

7b2cfa4

…to keep the code here

commented out tests for classify-paths

5c75825

sjanssen2 added 3 commits November 22, 2017 07:21

added an "experimental" label to function classify-otus

c5c5a25

bump version number

b985b4e

added extensive comments about tree traversal and lineage computation

25f8808

antgonza approved these changes Nov 22, 2017

View reviewed changes

wasade reviewed Nov 22, 2017

View reviewed changes

wasade approved these changes Nov 22, 2017

View reviewed changes

removed readme comments

0324829

sjanssen2 merged commit 2c59d6b into qiime2:master Nov 23, 2017

sjanssen2 deleted the classify-functions branch February 20, 2018 21:42


		You need to make sure, that the `--i-reference-taxonomy` matches the reference phylogeny used with function `sepp`.

		Alternatively, you can use the function `classify-paths` to a taxonomy. The lineage strings are obtained by traversing the insertion tree from each fragment tip towards the root and collecting all taxonomic labels in the reference tree along this path. Thus, taxonomy is only as good as provided reference phylogeny. Note, taxonomic labels are identified by containing two underscore characters `_` `_` as in Greengenes. As of Nov 2017: We do NOT encourage the use of this function, since it has not been compared to existing taxonomic assignment methods. Particularly since the default reference tree is not inline with the reference taxonomy.

		@@ -79,4 +74,70 @@
		)


		# text for readme, if we decide to put function 'classify_paths' back in:

Classify functions #19

Classify functions #19

Conversation

sjanssen2 commented Nov 18, 2017

sjanssen2 commented Nov 20, 2017

antgonza left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjanssen2 commented Nov 21, 2017

sjanssen2 commented Nov 22, 2017

wasade left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wasade commented Nov 22, 2017 via email

sjanssen2 commented Nov 22, 2017

wasade commented Nov 22, 2017 via email

sjanssen2 commented Nov 22, 2017

wasade commented Nov 22, 2017

sjanssen2 commented Nov 22, 2017

sjanssen2 commented Nov 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wasade commented Nov 22, 2017