[WIP] Add reference based alignment using SINA (fixes #53) #54

epruesse · 2018-09-11T04:53:47Z

thermokarst · 2018-09-20T21:03:23Z

.travis.yml

@@ -13,6 +13,8 @@ before_install:
 install:
  - wget -q https://raw.githubusercontent.com/qiime2/environment-files/master/latest/staging/qiime2-latest-py35-linux-conda.yml
  - conda env create -q -n test-env --file qiime2-latest-py35-linux-conda.yml
+  # update default install with local recipe:
+  - conda env update -q -n test-env --file ci/recipe/meta.yaml


Once you drop this we can re-run travis.

I was waiting for environment-files/staged to be updated to push the revert. Nice. Let's see...

gregcaporaso · 2018-09-27T17:20:39Z

Thanks @epruesse! I'm going to take one more pass through this - just need a day or two. The easiest path forward for docs would be to post a community tutorial on the forum. Do you want to start there?

gregcaporaso · 2018-09-27T20:27:16Z

@epruesse, can you point me at some reference files both in qza and arb formats? I'd like to test this out locally. (You'll want to have one or both of those in your docs too, so these can be the same files.)

epruesse · 2018-09-27T22:37:30Z

Thanks @epruesse! I'm going to take one more pass through this - just need a day or two. The easiest path forward for docs would be to post a community tutorial on the forum. Do you want to start there?

Yes, I'll do that. Thinking about a full workflow, not just alignment, I'd probably need to add pplacer or an raxml-add-to-tree Function, so I think I'll stick to just the alignment for now.

@epruesse, can you point me at some reference files both in qza and arb formats? I'd like to test this out locally. (You'll want to have one or both of those in your docs too, so these can be the same files.)

@gregcaporaso Thank you! Some testing outside of my computers is very welcome!

It may not be the best demo case, but this is the ARB file I use for testing: https://github.com/epruesse/sina_data/raw/master/ltp_reduced.arb.xz It's part of the LTP dataset. If you have enough memory (32G probably), the current SILVA ARB files will work as well.

I'll comment here once I have the tutorial ready.

epruesse · 2018-09-28T03:09:05Z

Ok, I've posted the tutorial. It contains some examples. I hope the text made it - after posting it disappeared for moderator confirmation, but I can't see the post at all myself...

thermokarst · 2018-11-06T20:48:17Z

Following up, this doesn't appear to be a problem with the QZA-based reference, probably because SINA writes these temp files whereever the reference ARB is. In the case of the QZA, that is in a temp dir (ref).

epruesse · 2018-11-06T21:27:26Z

The four LTPs132_SSU.arb.index.* files are side-effect files that we will probably get a ton of questions about on the QIIME 2 Forum, it would be ideal to write these to a tmp dir, instead. Thoughts?

Those are intentional. Think of them as .1.bt2, .2.bt2, etc. They are the search index generated by the ARB PT server.

In detail:

*index.arb is a reduced database, containing no meta data and no alignment (improves startup time and memory footprint of PT server vs using the full DB)
*index.arb.pt is the suffix trie itself
*index.ARF is used for database layers (not relevant in this case, but created by ARB anyway)
*index.ARM is a memory image of the reduced database (improves startup time and allows using shared memory)

When working with QZAs, there is no way for SINA to keep them around, so they are deleted with the temp folder. When working with "external" ARB reference databases, the whole point is to not spend an hour building those each time you want to run SINA.

thermokarst · 2018-11-06T21:32:03Z

Thanks @epruesse!

here is no way for SINA to keep them around

When working with "external" ARB reference databases, the whole point is to not spend an hour building those each time you want to run SINA

Sounds like these are all very compelling reasons for establishing a QIIME 2 Format, that way these files can:

a) be reused
b) be good citizens of the QIIME 2 ecosystem

I am happy to lend a hand with setting up a format (or formats). I proposed the tempfile solution, since you are already doing that with the QZA-variant, and it is arguably less development effort. Either way, I don't think I am comfortable leaving this "side-effect" behavior in a "core" QIIME 2 plugin --- I would like to see this resolved one way or the other before merging.

epruesse · 2018-11-07T01:25:59Z

@thermokarst Let's look at what I'm guessing would be a typical user story: "I want to align my 16S amplicons using the SILVA Ref NR 132 reference alignment".

The zipped ARB file is about 300MB and contains about 600k sequences. The way it is now, the user unpacks that ARB file in a useful location, say ~/databases/silva132. The first time the user runs qiime alignment sina, there will be a 4 hour index build generating some 10GB of index data. On every subsequent run, those files will simply be mapped into memory, so cause very little load time. Because it's the same files, parallel runs will share most of their memory requirements.

OTOH, if we put those 5 files into a QZA, the QZA will be ~5GB which need to be unpacked for every invocation. Load times will be higher as the files are "new" and not already in the Linux buffer cache. Parallel runs won't be able to share memory as each will have it's own unpacked copy, and therefore impossible on computers with less than 32GB. The only gain would be having four less files lying about, and being able to track which ARB file was used.

Since the latter is probably the only thing of real value to the user, I'd prefer to just log the ARB file parameter in the output QZA. Identifying it as input by name, timestamp, size and hash should be sufficient to satisfy provenance tracking.

If you guys can't stand the idea, I think it best to wait until I have a version of SINA that works without the PT server. It won't have exactly the same behavior, but using a simple inverted index the build time would be low enough to live without an on-disk cache of the index.

I just don't see that stashing index files in QZAs fits well with what they were designed to do.

thermokarst · 2018-11-07T17:49:49Z

Hey there @epruesse! Thanks so much for the detailed response there, I really appreciate it. Without going into too much detail, I agree 100% with what you have outlined here, and this definitely emphasizes several long-tracked issues in the QIIME 2 Framework. I had a chance to chat offline with @gregcaporaso, and we were thinking that organizing a call with with the three of us would be beneficial (and would probably streamline the discussion, rather than via GH Issues). If @ebolyen was in this hemisphere I would invite him too. Would you be available for a videocall this Friday?

epruesse · 2018-11-09T16:44:04Z

@thermokarst Just saw this - sent mail to Greg. If you guys have time, I'll be available most of the day.

epruesse · 2019-04-26T00:16:34Z

@ebolyen @thermokarst - how do we proceed here?

I'm going to push out SINA 1.6.0 today. With that, SINA will switch to the new internal search engine by default. Bit of a big update. Hope it goes well.

With the internal engine, I index the 700MB / 700k sequence SILVA SSU RefNR99 in about 4 minute generating a 300MB index file (250MB zipped). Having an index to keep around would still be neat. 4 minutes can be long. But it isn't a requirement any longer. So if you want, I can take the external option out.

ebolyen · 2019-11-08T23:51:43Z

Hey @epruesse, sorry we missed your pings. From the looks of the project log, we've been shuffling this issue around for a bit.

Re: index, if that's the case, then perhaps we can go with a zipped index (for now), and then work out the details of a proper caching strategy.

I'm still in favor of a user managed index/cache, where we basically have the artifacts live in an unzipped form with some tools to see how much space is being used and by what.

ebolyen · 2019-11-08T23:56:14Z

Found the forum discussion on this

jwdebelius · 2021-08-11T22:39:26Z

Can I ping on this issue?

epruesse · 2021-08-30T18:00:38Z

Yes - I have to get around to fixing this. Ping me again in a week? My current position doesn't leave me much time for this, so progress stalled

lizgehret · 2022-09-15T17:37:32Z

Hi @epruesse! We are currently doing some PR triage and review right now - is this something you're still working on and have the bandwidth for? Or should we close out this PR?

epruesse · 2022-09-15T21:27:28Z

Can we keep it open so I don't forget it? I still haven't given up hope I might yet be able to complete this. Could make it an issue instead, I'm sure the PR needs major rework by now.

ebolyen · 2022-09-21T23:00:59Z

No problem @epruesse. It does look like there's already an issue, so I'm going to give it a bump and link back this this PR :)

epruesse and others added 3 commits August 27, 2018 12:19

Add SINA - naive untested code

b4d48de

First functional

088aa42

Merge branch 'master' into master

0e332aa

epruesse mentioned this pull request Sep 11, 2018

add SINA reference based alignment #53

Open

2 tasks

epruesse added 4 commits September 10, 2018 23:15

Expose kmer-length

c6054f1

Add unit test

dfd1f76

Document tests

a623db7

Move duplicate fasta id check to function

4faa33e

epruesse force-pushed the master branch from 75f6fc0 to 9b1ff91 Compare September 11, 2018 22:20

Add num_references parameter to SINA

bfa852f

epruesse force-pushed the master branch from 9b1ff91 to bfa852f Compare September 11, 2018 22:20

epruesse added 2 commits September 15, 2018 17:01

Placate q2lint

ed1aef1

Fix linter errors

1752c0f

epruesse mentioned this pull request Sep 16, 2018

CI does not honor ci/recipe/meta.yaml for PRs #55

Closed

epruesse added 2 commits September 16, 2018 15:36

Remove additional copyright claim

a359d34

Try working around ci/recipe/meta.yaml ignored (see qiime2#55)

ae5a17f

epruesse mentioned this pull request Sep 17, 2018

Add sina to requirements #56

Merged

thermokarst reviewed Sep 20, 2018

View reviewed changes

epruesse and others added 4 commits September 20, 2018 18:15

Revert ae5a17f

4a4caa1

Merge branch 'master' into master

0f8d6cb

Don't use (cryptic temp) filename in messages

12d2760

Fix missing empty line

8c54e9b

thermokarst assigned gregcaporaso Sep 27, 2018

thermokarst requested review from gregcaporaso and thermokarst September 27, 2018 22:09

thermokarst removed their assignment Nov 6, 2018

gregcaporaso mentioned this pull request Nov 6, 2018

DNAFASTAFormat and AlignedDNAFASTAFormat should check for duplicate ids during full validation qiime2/q2-types#198

Open

gregcaporaso mentioned this pull request Nov 13, 2018

add document covering pros and cons of creating a plugin versus contributing to an existing one qiime2/dev-docs#27

Closed

ebolyen closed this Sep 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add reference based alignment using SINA (fixes #53) #54

[WIP] Add reference based alignment using SINA (fixes #53) #54

epruesse commented Sep 11, 2018 •

edited

Loading

thermokarst Sep 20, 2018

epruesse Sep 21, 2018

gregcaporaso commented Sep 27, 2018 •

edited

Loading

gregcaporaso commented Sep 27, 2018 •

edited

Loading

epruesse commented Sep 27, 2018

epruesse commented Sep 28, 2018

thermokarst commented Nov 6, 2018

epruesse commented Nov 6, 2018

thermokarst commented Nov 6, 2018

epruesse commented Nov 7, 2018

thermokarst commented Nov 7, 2018 •

edited

Loading

epruesse commented Nov 9, 2018

epruesse commented Apr 26, 2019

ebolyen commented Nov 8, 2019

ebolyen commented Nov 8, 2019

jwdebelius commented Aug 11, 2021

epruesse commented Aug 30, 2021

lizgehret commented Sep 15, 2022

epruesse commented Sep 15, 2022

ebolyen commented Sep 21, 2022

[WIP] Add reference based alignment using SINA (fixes #53) #54

[WIP] Add reference based alignment using SINA (fixes #53) #54

Conversation

epruesse commented Sep 11, 2018 • edited Loading

thermokarst Sep 20, 2018

Choose a reason for hiding this comment

epruesse Sep 21, 2018

Choose a reason for hiding this comment

gregcaporaso commented Sep 27, 2018 • edited Loading

gregcaporaso commented Sep 27, 2018 • edited Loading

epruesse commented Sep 27, 2018

epruesse commented Sep 28, 2018

thermokarst commented Nov 6, 2018

epruesse commented Nov 6, 2018

thermokarst commented Nov 6, 2018

epruesse commented Nov 7, 2018

thermokarst commented Nov 7, 2018 • edited Loading

epruesse commented Nov 9, 2018

epruesse commented Apr 26, 2019

ebolyen commented Nov 8, 2019

ebolyen commented Nov 8, 2019

jwdebelius commented Aug 11, 2021

epruesse commented Aug 30, 2021

lizgehret commented Sep 15, 2022

epruesse commented Sep 15, 2022

ebolyen commented Sep 21, 2022

epruesse commented Sep 11, 2018 •

edited

Loading

gregcaporaso commented Sep 27, 2018 •

edited

Loading

gregcaporaso commented Sep 27, 2018 •

edited

Loading

thermokarst commented Nov 7, 2018 •

edited

Loading