Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented a different version of the multidog parallel loop #17

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

alethere
Copy link

Hi David,

I contacted you some time ago (November last year) suggesting a different approach to multidog, writing to files instead of outputting a data.frame. I see that since then you've changed to the future package for the parallelization management. I have not implemented the algorith using future but I imagine it will work the same.

My test on 100K SNPs shows the following time usage:
Function athe_small took 1.66 h
Function athe_all took 1.94 h
Function multidog took 3.15 h

Where athe_small is multidog writing only the snp parameters (thinkgs like prop_mis that have one estimate per marker) and the genotypes; athe_all that writes all possible outputs in different tables; and multidog which is the original implementation.

You see that the efficiency improvement on time is relatively small. I suspect memory usage should be better, as that's what I found when doing it on my own computer, although I couldn't confirm it in the computer cluster where I performed the test above (reading memory usage turns out to be more complicated than I anticipated).

Small overview of the function changes:

  • Output is written into multiple tables instead of returned as a data.frame. This is achieved by parallel writing into one file in groups of 100 markers. This creates a few corrupted lines (~0.2% of lines, ~0.4% of markers) that are eliminated. Writing into files instead of storing in memory the results while the loop is going improves memory usage substantially (should test anew).
  • User can define desired output, less output equals faster computation time.
  • Multidog object class and multidog plots are not available anymore. A "multidog builder" could be implemented so that based on the tables a multidog object can be created, which would allow the usage of plot_multidog().
  • The "future" package has not been used for parallelization.

Let's see what you think.

Cheers,
Alejandro

PS: Sorry for the delay with submitting, some other research got in the way.

@dcgerard
Copy link
Owner

dcgerard commented Mar 5, 2021

Hey @alethere, thanks so much for doing all of this!

I just want to pop in real quick and say that I've been really busy, so haven't had a chance to check things out. I'll get around to looking at the changes.

One quick comment: Maybe it would be better to create a new function, rather than replace the multidog() function. I love that your method takes less time, but some use-cases would work better without having any corrupted lines. So we could have a new function, say parwdog() (for parallel writing updog), that a user could use for speed improvements, but possible line corruption? Let me know what you think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants