Implemented a different version of the multidog parallel loop #17

alethere · 2021-02-25T13:00:15Z

Hi David,

I contacted you some time ago (November last year) suggesting a different approach to multidog, writing to files instead of outputting a data.frame. I see that since then you've changed to the future package for the parallelization management. I have not implemented the algorith using future but I imagine it will work the same.

My test on 100K SNPs shows the following time usage:
Function athe_small took 1.66 h
Function athe_all took 1.94 h
Function multidog took 3.15 h

Where athe_small is multidog writing only the snp parameters (thinkgs like prop_mis that have one estimate per marker) and the genotypes; athe_all that writes all possible outputs in different tables; and multidog which is the original implementation.

You see that the efficiency improvement on time is relatively small. I suspect memory usage should be better, as that's what I found when doing it on my own computer, although I couldn't confirm it in the computer cluster where I performed the test above (reading memory usage turns out to be more complicated than I anticipated).

Small overview of the function changes:

Output is written into multiple tables instead of returned as a data.frame. This is achieved by parallel writing into one file in groups of 100 markers. This creates a few corrupted lines (~0.2% of lines, ~0.4% of markers) that are eliminated. Writing into files instead of storing in memory the results while the loop is going improves memory usage substantially (should test anew).
User can define desired output, less output equals faster computation time.
Multidog object class and multidog plots are not available anymore. A "multidog builder" could be implemented so that based on the tables a multidog object can be created, which would allow the usage of plot_multidog().
The "future" package has not been used for parallelization.

Let's see what you think.

Cheers,
Alejandro

PS: Sorry for the delay with submitting, some other research got in the way.

dcgerard · 2021-03-05T13:47:21Z

Hey @alethere, thanks so much for doing all of this!

I just want to pop in real quick and say that I've been really busy, so haven't had a chance to check things out. I'll get around to looking at the changes.

One quick comment: Maybe it would be better to create a new function, rather than replace the multidog() function. I love that your method takes less time, but some use-cases would work better without having any corrupted lines. So we could have a new function, say parwdog() (for parallel writing updog), that a user could use for speed improvements, but possible line corruption? Let me know what you think.

Implemented a different version of the multidog parallel loop

7686681

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented a different version of the multidog parallel loop #17

Implemented a different version of the multidog parallel loop #17

alethere commented Feb 25, 2021

dcgerard commented Mar 5, 2021

Implemented a different version of the multidog parallel loop #17

Are you sure you want to change the base?

Implemented a different version of the multidog parallel loop #17

Conversation

alethere commented Feb 25, 2021

dcgerard commented Mar 5, 2021