Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelization #4

Open
ruiye88 opened this issue Sep 8, 2024 · 3 comments
Open

Parallelization #4

ruiye88 opened this issue Sep 8, 2024 · 3 comments

Comments

@ruiye88
Copy link

ruiye88 commented Sep 8, 2024

Hi,

Thank you for developing this wonderful tool. Just curious are you planning to add parallel computing options for the function?

Rui

@pcarbo
Copy link
Member

pcarbo commented Sep 8, 2024

Thanks for your interest @ruiye88. Have you tried the current implementation on your data set? Is it too slow? Could you tell us a little bit more about the size of your data set?

@ruiye88
Copy link
Author

ruiye88 commented Sep 9, 2024

Hi Peter, thanks for the quick response. I tried a test run on a subset of my dataset (~1500 cells, maxiter1 = 100,maxiter2 = 50, maxiter3 = 50) and it takes about 20-30 minutes. My full dataset has ~50K cells. Do you have an estimate of how long it might take to run the full dataset? Also, I'm assuming most users may want to run multiple Kmax values to compare the results. Therefore, it would be really helpful if the parallel computation could be implemented.

@pcarbo
Copy link
Member

pcarbo commented Sep 9, 2024

@ruiye88 In the paper, we ran on a dataset containing ~35,000 cells, which is quite comparable to your dataset. Does your counts matrix have a high proportion of zeros and, of so, is it encoded as a sparse matrix? My understanding is that if your Y matrix has many rows and is sparse, gbcd will run faster. In particular, it runs the more efficient method that does not compute the (dense) N x N covariance matrix if this condition is satisfied:

2 * ncol(Y) * mean(Y > 0) < nrow(Y)

For us, the more efficient implementation ran on the dataset with ~35,000 cells in about 20 h.

You could potentially also run on multiple Kmax values in parallel (e.g., using mclapply), although it may use a lot of memory.

There is some support for parallel computations in the current implementation if you have installed R with a version of the BLAS library that supports multithreading, such as OpenBLAS or Intel MKL, that that should speed things up a bit, although it is more important to make sure your data are encoded properly as a sparse matrix.

Hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants