-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add reference based alignment using SINA (fixes #53) #54
Conversation
.travis.yml
Outdated
@@ -13,6 +13,8 @@ before_install: | |||
install: | |||
- wget -q https://raw.githubusercontent.com/qiime2/environment-files/master/latest/staging/qiime2-latest-py35-linux-conda.yml | |||
- conda env create -q -n test-env --file qiime2-latest-py35-linux-conda.yml | |||
# update default install with local recipe: | |||
- conda env update -q -n test-env --file ci/recipe/meta.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once you drop this we can re-run travis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was waiting for environment-files/staged to be updated to push the revert. Nice. Let's see...
Thanks @epruesse! I'm going to take one more pass through this - just need a day or two. The easiest path forward for docs would be to post a community tutorial on the forum. Do you want to start there? |
@epruesse, can you point me at some reference files both in qza and arb formats? I'd like to test this out locally. (You'll want to have one or both of those in your docs too, so these can be the same files.) |
Yes, I'll do that. Thinking about a full workflow, not just alignment, I'd probably need to add
@gregcaporaso Thank you! Some testing outside of my computers is very welcome! It may not be the best demo case, but this is the ARB file I use for testing: https://github.com/epruesse/sina_data/raw/master/ltp_reduced.arb.xz It's part of the LTP dataset. If you have enough memory (32G probably), the current SILVA ARB files will work as well. I'll comment here once I have the tutorial ready. |
Ok, I've posted the tutorial. It contains some examples. I hope the text made it - after posting it disappeared for moderator confirmation, but I can't see the post at all myself... |
Following up, this doesn't appear to be a problem with the QZA-based reference, probably because SINA writes these temp files whereever the reference ARB is. In the case of the QZA, that is in a temp dir (ref). |
Those are intentional. Think of them as In detail:
When working with QZAs, there is no way for SINA to keep them around, so they are deleted with the temp folder. When working with "external" ARB reference databases, the whole point is to not spend an hour building those each time you want to run SINA. |
Thanks @epruesse!
Sounds like these are all very compelling reasons for establishing a QIIME 2 Format, that way these files can: a) be reused I am happy to lend a hand with setting up a format (or formats). I proposed the tempfile solution, since you are already doing that with the QZA-variant, and it is arguably less development effort. Either way, I don't think I am comfortable leaving this "side-effect" behavior in a "core" QIIME 2 plugin --- I would like to see this resolved one way or the other before merging. |
@thermokarst Let's look at what I'm guessing would be a typical user story: "I want to align my 16S amplicons using the SILVA Ref NR 132 reference alignment". The zipped ARB file is about 300MB and contains about 600k sequences. The way it is now, the user unpacks that ARB file in a useful location, say OTOH, if we put those 5 files into a QZA, the QZA will be ~5GB which need to be unpacked for every invocation. Load times will be higher as the files are "new" and not already in the Linux buffer cache. Parallel runs won't be able to share memory as each will have it's own unpacked copy, and therefore impossible on computers with less than 32GB. The only gain would be having four less files lying about, and being able to track which ARB file was used. Since the latter is probably the only thing of real value to the user, I'd prefer to just log the ARB file parameter in the output QZA. Identifying it as input by name, timestamp, size and hash should be sufficient to satisfy provenance tracking. If you guys can't stand the idea, I think it best to wait until I have a version of SINA that works without the PT server. It won't have exactly the same behavior, but using a simple inverted index the build time would be low enough to live without an on-disk cache of the index. I just don't see that stashing index files in QZAs fits well with what they were designed to do. |
Hey there @epruesse! Thanks so much for the detailed response there, I really appreciate it. Without going into too much detail, I agree 100% with what you have outlined here, and this definitely emphasizes several long-tracked issues in the QIIME 2 Framework. I had a chance to chat offline with @gregcaporaso, and we were thinking that organizing a call with with the three of us would be beneficial (and would probably streamline the discussion, rather than via GH Issues). If @ebolyen was in this hemisphere I would invite him too. Would you be available for a videocall this Friday? |
@thermokarst Just saw this - sent mail to Greg. If you guys have time, I'll be available most of the day. |
@ebolyen @thermokarst - how do we proceed here? I'm going to push out SINA 1.6.0 today. With that, SINA will switch to the new internal search engine by default. Bit of a big update. Hope it goes well. With the internal engine, I index the 700MB / 700k sequence SILVA SSU RefNR99 in about 4 minute generating a 300MB index file (250MB zipped). Having an index to keep around would still be neat. 4 minutes can be long. But it isn't a requirement any longer. So if you want, I can take the external option out. |
Hey @epruesse, sorry we missed your pings. From the looks of the project log, we've been shuffling this issue around for a bit. Re: index, if that's the case, then perhaps we can go with a zipped index (for now), and then work out the details of a proper caching strategy. I'm still in favor of a user managed index/cache, where we basically have the artifacts live in an unzipped form with some tools to see how much space is being used and by what. |
Can I ping on this issue? |
Yes - I have to get around to fixing this. Ping me again in a week? My current position doesn't leave me much time for this, so progress stalled |
Hi @epruesse! We are currently doing some PR triage and review right now - is this something you're still working on and have the bandwidth for? Or should we close out this PR? |
Can we keep it open so I don't forget it? I still haven't given up hope I might yet be able to complete this. Could make it an issue instead, I'm sure the PR needs major rework by now. |
No problem @epruesse. It does look like there's already an issue, so I'm going to give it a bump and link back this this PR :) |
See #53