Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add reference based alignment using SINA (fixes #53) #54

Closed
wants to merge 22 commits into from

Conversation

epruesse
Copy link
Member

@epruesse epruesse commented Sep 11, 2018

See #53

@epruesse epruesse mentioned this pull request Sep 17, 2018
.travis.yml Outdated
@@ -13,6 +13,8 @@ before_install:
install:
- wget -q https://raw.githubusercontent.com/qiime2/environment-files/master/latest/staging/qiime2-latest-py35-linux-conda.yml
- conda env create -q -n test-env --file qiime2-latest-py35-linux-conda.yml
# update default install with local recipe:
- conda env update -q -n test-env --file ci/recipe/meta.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you drop this we can re-run travis.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was waiting for environment-files/staged to be updated to push the revert. Nice. Let's see...

@gregcaporaso
Copy link
Member

gregcaporaso commented Sep 27, 2018

Thanks @epruesse! I'm going to take one more pass through this - just need a day or two. The easiest path forward for docs would be to post a community tutorial on the forum. Do you want to start there?

@gregcaporaso
Copy link
Member

gregcaporaso commented Sep 27, 2018

@epruesse, can you point me at some reference files both in qza and arb formats? I'd like to test this out locally. (You'll want to have one or both of those in your docs too, so these can be the same files.)

@epruesse
Copy link
Member Author

Thanks @epruesse! I'm going to take one more pass through this - just need a day or two. The easiest path forward for docs would be to post a community tutorial on the forum. Do you want to start there?

Yes, I'll do that. Thinking about a full workflow, not just alignment, I'd probably need to add pplacer or an raxml-add-to-tree Function, so I think I'll stick to just the alignment for now.

@epruesse, can you point me at some reference files both in qza and arb formats? I'd like to test this out locally. (You'll want to have one or both of those in your docs too, so these can be the same files.)

@gregcaporaso Thank you! Some testing outside of my computers is very welcome!

It may not be the best demo case, but this is the ARB file I use for testing: https://github.com/epruesse/sina_data/raw/master/ltp_reduced.arb.xz It's part of the LTP dataset. If you have enough memory (32G probably), the current SILVA ARB files will work as well.

I'll comment here once I have the tutorial ready.

@epruesse
Copy link
Member Author

Ok, I've posted the tutorial. It contains some examples. I hope the text made it - after posting it disappeared for moderator confirmation, but I can't see the post at all myself...

@thermokarst thermokarst removed their assignment Nov 6, 2018
@thermokarst
Copy link
Contributor

Following up, this doesn't appear to be a problem with the QZA-based reference, probably because SINA writes these temp files whereever the reference ARB is. In the case of the QZA, that is in a temp dir (ref).

@epruesse
Copy link
Member Author

epruesse commented Nov 6, 2018

The four LTPs132_SSU.arb.index.* files are side-effect files that we will probably get a ton of questions about on the QIIME 2 Forum, it would be ideal to write these to a tmp dir, instead. Thoughts?

Those are intentional. Think of them as .1.bt2, .2.bt2, etc. They are the search index generated by the ARB PT server.

In detail:

  • *index.arb is a reduced database, containing no meta data and no alignment (improves startup time and memory footprint of PT server vs using the full DB)
  • *index.arb.pt is the suffix trie itself
  • *index.ARF is used for database layers (not relevant in this case, but created by ARB anyway)
  • *index.ARM is a memory image of the reduced database (improves startup time and allows using shared memory)

When working with QZAs, there is no way for SINA to keep them around, so they are deleted with the temp folder. When working with "external" ARB reference databases, the whole point is to not spend an hour building those each time you want to run SINA.

@thermokarst
Copy link
Contributor

Thanks @epruesse!

here is no way for SINA to keep them around

When working with "external" ARB reference databases, the whole point is to not spend an hour building those each time you want to run SINA

Sounds like these are all very compelling reasons for establishing a QIIME 2 Format, that way these files can:

a) be reused
b) be good citizens of the QIIME 2 ecosystem

I am happy to lend a hand with setting up a format (or formats). I proposed the tempfile solution, since you are already doing that with the QZA-variant, and it is arguably less development effort. Either way, I don't think I am comfortable leaving this "side-effect" behavior in a "core" QIIME 2 plugin --- I would like to see this resolved one way or the other before merging.

@epruesse
Copy link
Member Author

epruesse commented Nov 7, 2018

@thermokarst Let's look at what I'm guessing would be a typical user story: "I want to align my 16S amplicons using the SILVA Ref NR 132 reference alignment".

The zipped ARB file is about 300MB and contains about 600k sequences. The way it is now, the user unpacks that ARB file in a useful location, say ~/databases/silva132. The first time the user runs qiime alignment sina, there will be a 4 hour index build generating some 10GB of index data. On every subsequent run, those files will simply be mapped into memory, so cause very little load time. Because it's the same files, parallel runs will share most of their memory requirements.

OTOH, if we put those 5 files into a QZA, the QZA will be ~5GB which need to be unpacked for every invocation. Load times will be higher as the files are "new" and not already in the Linux buffer cache. Parallel runs won't be able to share memory as each will have it's own unpacked copy, and therefore impossible on computers with less than 32GB. The only gain would be having four less files lying about, and being able to track which ARB file was used.

Since the latter is probably the only thing of real value to the user, I'd prefer to just log the ARB file parameter in the output QZA. Identifying it as input by name, timestamp, size and hash should be sufficient to satisfy provenance tracking.

If you guys can't stand the idea, I think it best to wait until I have a version of SINA that works without the PT server. It won't have exactly the same behavior, but using a simple inverted index the build time would be low enough to live without an on-disk cache of the index.

I just don't see that stashing index files in QZAs fits well with what they were designed to do.

@thermokarst
Copy link
Contributor

thermokarst commented Nov 7, 2018

Hey there @epruesse! Thanks so much for the detailed response there, I really appreciate it. Without going into too much detail, I agree 100% with what you have outlined here, and this definitely emphasizes several long-tracked issues in the QIIME 2 Framework. I had a chance to chat offline with @gregcaporaso, and we were thinking that organizing a call with with the three of us would be beneficial (and would probably streamline the discussion, rather than via GH Issues). If @ebolyen was in this hemisphere I would invite him too. Would you be available for a videocall this Friday?

@epruesse
Copy link
Member Author

epruesse commented Nov 9, 2018

@thermokarst Just saw this - sent mail to Greg. If you guys have time, I'll be available most of the day.

@epruesse
Copy link
Member Author

@ebolyen @thermokarst - how do we proceed here?

I'm going to push out SINA 1.6.0 today. With that, SINA will switch to the new internal search engine by default. Bit of a big update. Hope it goes well.

With the internal engine, I index the 700MB / 700k sequence SILVA SSU RefNR99 in about 4 minute generating a 300MB index file (250MB zipped). Having an index to keep around would still be neat. 4 minutes can be long. But it isn't a requirement any longer. So if you want, I can take the external option out.

@ebolyen
Copy link
Member

ebolyen commented Nov 8, 2019

Hey @epruesse, sorry we missed your pings. From the looks of the project log, we've been shuffling this issue around for a bit.

Re: index, if that's the case, then perhaps we can go with a zipped index (for now), and then work out the details of a proper caching strategy.

I'm still in favor of a user managed index/cache, where we basically have the artifacts live in an unzipped form with some tools to see how much space is being used and by what.

@ebolyen
Copy link
Member

ebolyen commented Nov 8, 2019

@jwdebelius
Copy link

Can I ping on this issue?

@epruesse
Copy link
Member Author

Yes - I have to get around to fixing this. Ping me again in a week? My current position doesn't leave me much time for this, so progress stalled

@lizgehret
Copy link
Member

Hi @epruesse! We are currently doing some PR triage and review right now - is this something you're still working on and have the bandwidth for? Or should we close out this PR?

@epruesse
Copy link
Member Author

Can we keep it open so I don't forget it? I still haven't given up hope I might yet be able to complete this. Could make it an issue instead, I'm sure the PR needs major rework by now.

@ebolyen
Copy link
Member

ebolyen commented Sep 21, 2022

No problem @epruesse. It does look like there's already an issue, so I'm going to give it a bump and link back this this PR :)

@ebolyen ebolyen closed this Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants