Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run SEACells efficiently on large-scale dataset #70

Open
koh2ng0 opened this issue Jul 22, 2024 · 2 comments
Open

How to run SEACells efficiently on large-scale dataset #70

koh2ng0 opened this issue Jul 22, 2024 · 2 comments

Comments

@koh2ng0
Copy link

koh2ng0 commented Jul 22, 2024

Hi,

First of all, thank you for developing the excellent package. I have tried to run the SEACells on our large-scale datasets (~270K cells). While it performed well, it was too slow, taking almost 3 days and 3 hours for model training over 50 iterations.

I tried two approaches: using GPU and CPU.

  1. with GPU
    I attempted to run SEACells with GPU using the following commands:
model = SEACells.core.SEACells(adata, 
                                                        build_kernel_on=build_kernel_on, 
                                                        n_SEACells=n_SEACells, 
                                                        n_waypoint_eigs=n_waypoint_eigs,
                                                        convergence_epsilon = 1e-5,
                                                        use_gpu=True)

However, I encountered the following error:

"OutOfMemoryError: Out of memory allocating 6,121,777,152 bytes (allocated so far: 32,323,490,304 bytes)."
We have 3 GPUs, each with 32768MiB memory. I believed this would be sufficient, so I'm not sure why this error occurred.
Screenshot 2024-07-22 at 15 33 21
Could you guide how to resolve this issue? Additionally, is it possible to utilize more than one GPU for this process?

  1. with CPU
    While it works, it excessively takes too much time.
model = SEACells.core.SEACells(adata,
                                                        build_kernel_on = 'X_scVI',
                                                        n_SEACells = n_SEACells,
                                                        n_waypoint_eigs = n_waypoint_eigs,
                                                        convergence_epsilon = 1e-5,
                                                        use_sparse = True)

Could you recommend solutions to improve the time and memory efficiency for running SEACells on large-scale datasets?

Thank you for your assistance.

@kjtreese
Copy link

I'm hoping someone has some input on this because I'm running into the same issue with a dataset of 240K cells. We're splitting it into smaller chunks but it still takes up SO much memory and time. We're wanting metacells of smaller sizes to match (at least as closely as possible) to the ones we already have done manually, so I'm setting the number of SEACells to 1000+ but it's just so slow.

@li-xuyang28
Copy link

I saw that at some point sparse matrix with GPU was planned/proposed, but never implemented. I was wondering if there is any current plan for that to happen? Would certainly love to see this tool being scalable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants