Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a method to use a high performance implementation of UniFrac #143

Merged
merged 19 commits into from
Sep 26, 2017

Conversation

ElDeveloper
Copy link
Member

@ElDeveloper ElDeveloper commented Sep 20, 2017

This PR adds the necessary code to use the implementation of UniFrac in https://github.com/biocore/unifrac . This method is recommended for large datasets.

As a comparison, for a table with ~120 samples and a tree with 20K tips running time is: 2 minutes and 27 seconds with the current implementation, and 2 seconds with the new implementation. @wasade has been benchmarking this more thoroughly but, thought this served as a nice and small comparison.

As discussed at the developer's meeting, this method does not replace beta_phylogenetic, as it is still fairly new, instead it is provided as an alternative (beta_phylogenetic_hpc) that users can test out.

Currently, there's a few things missing in this PR:

@wasade
Copy link
Member

wasade commented Sep 20, 2017

Thanks, @ElDeveloper! One additional point, we include support for Generalized UniFrac and Variance Adjusted UniFrac. We also support meta UniFrac but it assumes variadic arguments support (qiime2/qiime2#234); it may be nice to put in a near term work around though.

@jairideout
Copy link
Member

Tests, it would make more sense to add a few files to the repo, rather than having to create those on the fly, but I am not sure if there's a preferred way to do this.

Take a look at some of the other QIIME 2 repos for examples of adding test data files -- we have a custom unittest.TestCase subclass that provides a helper for locating test data. I think q2-types is a good example -- let us know if you run into issues. Please try to keep the files small.

Citation information, it would be nice to include more information about the specific papers that should be cited for each metric. We previously had this when this was its own plugin, see here.

Citation support is currently limited to free text, so you can add whatever citations you want to the plugin using the citation_text field (I don't recommend putting citation info in the description field like you have in your linked example).

@ElDeveloper
Copy link
Member Author

ElDeveloper commented Sep 20, 2017 via email

@ElDeveloper
Copy link
Member Author

@jairideout, I've added the citation text and tests. 👍 This should be contingent on @wasade publishing a conda package, but is otherwise good to review.

@wasade
Copy link
Member

wasade commented Sep 20, 2017

@ElDeveloper, would it be possible to expose bypass_tips, a.k.a. fast mode? As a parameter description: "Compute will be reduced by ~50% with this option yielding an approximate result. The option disregards computing UniFrac at the tips of the phylogeny and instead only computes it on the internal nodes. In a bifurcating tree, the tips correspond to approximately 50% of the vertices in the tree, which is how the compute time reduction is achieved. The use of this parameter is analogous in concept to switching from 99% OTUs to 97% OTUs."

Thanks @mortonjt @wasade!

Add bypass-tips
@ElDeveloper ElDeveloper changed the title Adds a method to use a high performance implementation of UniFrac [WIP] Adds a method to use a high performance implementation of UniFrac Sep 21, 2017
@ElDeveloper
Copy link
Member Author

Tests are passing 🎉

@ebolyen
Copy link
Member

ebolyen commented Sep 23, 2017

Looks like it worked! We've got new environment files to test with 🎉!

@wasade
Copy link
Member

wasade commented Sep 23, 2017 via email

Copy link
Member

@ebolyen ebolyen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review, this is generally looking really good though! I'm going to pull this down and play with it a bit, but here's a few comments in the meanwhile.

@@ -63,6 +69,27 @@ def beta_phylogenetic(table: biom.Table, phylogeny: skbio.TreeNode,
return results


def beta_phylogenetic_hpc(table: BIOMV210Format, phylogeny: NewickFormat,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How you guys feel about something like: beta_phylogenetic_alt, then it may not be too much of a surprise if/when this gets merged into beta_phylogenetic.

We could also do something with unifrac like unifrac_variants or something, but given you are trying to roughly match the signature, it seems like an -alt suffix might work pretty well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion, changed to alt!

'https://doi.org/10.1186/1471-2105-12-118.'
'Generalized UniFrac: '
'Chen et al. 2012 Bioinformatics; DOI: '
'10.1093/bioinformatics/bts342')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have a convention here, but some newline characters might work well to separate these

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't see any changes with the new lines I added. ¯_(ツ)_/¯

'weighted_unnormalized_unifrac',
'weighted_normalized_unifrac',
'generalized_unifrac']),
'n_jobs': Int,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be Int % Range(1, None)? What happens with negatives or zero?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Negative numbers are not supported, so this is great!

'generalized_unifrac']),
'n_jobs': Int,
'variance_adjusted': Bool,
'alpha': Float, 'bypass_tips': Bool},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a newline so we can find bypass_tips in the future

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this means, but done.

'metric': 'The beta diversity metric to be computed.',
'n_jobs': 'The number of workers to use.',
'variance_adjusted': ('Perform variance adjustment based on Chang et '
'al. BMC Bioinformatics 2011'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to briefly describe what that means?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an explanation, let me know if it makes sense, also CCing @wasade to keep him in the loop.

'correspond to approximately 50% of the vertices in '
'the tree, which is how the compute time reduction is'
' achieved. The use of this parameter is analogous '
'in concept to switching from 99% OTUs to 97% OTUs.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just say:

In a bifurcating tree, the tips make up about 50% of the nodes in a tree. By ignoring them, specificity can be traded for reduced compute time.

Or something like that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's much more concise, thanks.

'generalized_unifrac']),
'n_jobs': Int,
'variance_adjusted': Bool,
'alpha': Float, 'bypass_tips': Bool},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Float -> Float % Range(0, 1, inclusive_end=True)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, I had previously tried Float % Range(0, 1) but that wasn't working, I'm glad this works.

"2012), Variance Adjusted UniFrac (Chang et al. 2011), "
"as well as Weighted normalized and unnormalized UniFrac "
"(Lozupone et al. 2007), unweighted UniFrac "
"(Lozupone et al. 2005), and metagenomic UniFrac "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think metagenomic UniFrac is exposed at the moment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

elif metric == 'weighted_normalized_unifrac':
f = unifrac.weighted_normalized
elif metric == 'generalized_unifrac':
f = partial(unifrac.generalized, alpha=alpha)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but a dictionary lookup might be cleaner (that's usually what we do).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It made things cleaner, though we needed a bit of special casing though 👍

def beta_phylogenetic_hpc(table: BIOMV210Format, phylogeny: NewickFormat,
metric: str, n_jobs: int=1,
variance_adjusted: bool=False,
alpha=1.0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong recommendation, but it might be nice to error if alpha is passed when the metric isn't generalized UniFrac. That would require you set the default to something like None though, which definitely obfuscates things a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed your suggestion here.

@@ -63,6 +76,31 @@ def beta_phylogenetic(table: biom.Table, phylogeny: skbio.TreeNode,
return results


def beta_phylogenetic_alt(table: BIOMV210Format, phylogeny: NewickFormat,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to politely catch and err if the table has < 4 samples? That's an edge case we haven't handled yet behind the scenes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that this is the reason why we've observed failures, for all the unit tests we compute distances on a 2x2 sample matrix, and those tests pass.

)

plugin.methods.register_function(
function=q2_diversity.beta_phylogenetic,
inputs={'table': FeatureTable[Frequency],
'phylogeny': Phylogeny[Rooted]},
parameters={'metric': Str % Choices(beta.phylogenetic_metrics()),
'n_jobs': Int},
'n_jobs': Int % Range(1, None)},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is minor, and ultimately whatever fits better into the surrounding terminology is preferred, but it is more correct to describe this as threads.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jobs is probably better for now, since it sounds like this is actually CPU threads, and not a process thread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's process threads (i.e., pthread) that we bind to cores to reduce thread hopping and to leverage NUMA's default local allocation policy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I wrote this, I knew it was misleading, but carried on since it matches beta_phylogenetic. Would probably be nice to have a standard designation for "workers" all throughout QIIME.

'n_jobs': 'The number of workers to use.',
'variance_adjusted': ('Perform variance adjustment based on Chang et '
'al. BMC Bioinformatics 2011. Weights distances '
'by the number of tips evaluated for each '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It weights based on the proportion of the relative abundance represented between the samples at a given node under evaluation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

' proportions are dichotomized.'),
'bypass_tips': ('In a bifurcating tree, the tips make up about 50% of '
'the nodes in a tree. By ignoring them, specificity '
'can be traded for reduced compute time.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add: "time. This has the effect of collapsing the phylogeny, and is analogous in concept to moving from 99% to 97% OTUs"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@ebolyen ebolyen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more comments


if alpha is not None and metric != 'generalized_unifrac':
raise ValueError('The alpha parameter is only allowed when the choice'
' of metric is generalized_unifrac')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "opposite" of this causes unifrac to raise an error, but the user is too far removed to be able to pinpoint the cause.

So we should add a branch to either set alpha=1 when alpha == None or to explain that alpha needs to be set. Either should work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!!! 👏

f = metrics[metric]

# unifrac processes tables and trees should be filenames
return f(str(table), str(phylogeny), threads=n_jobs,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when n_jobs is greater than logical cores, we get a segfault (when we're lucky). We should just guard that value to be at or less than cores present on the host.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

psutil to the rescure 💪

@@ -32,6 +32,7 @@ requirements:
- ipywidgets
- scikit-bio
- scikit-learn
- psutilt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra t

@ebolyen ebolyen merged commit dd1322b into qiime2:master Sep 26, 2017
@ElDeveloper
Copy link
Member Author

Thanks for fixing @ebolyen!

@wasade
Copy link
Member

wasade commented Sep 26, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants