Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation hangs with accelerate over multiple gpus. #4

Open
tyleryzhu opened this issue Mar 22, 2024 · 3 comments
Open

Evaluation hangs with accelerate over multiple gpus. #4

tyleryzhu opened this issue Mar 22, 2024 · 3 comments

Comments

@tyleryzhu
Copy link

tyleryzhu commented Mar 22, 2024

Thank you for the incredible set of repositories (this one and prismatic-vlms), it has been a great joy using them. Very well-designed, configurable, and easy to use for researchers.

I'm running into a problem where evaluation hangs when run over multiple GPUs, precisely at the step where I load the local model checkpoint. This doesn't happen with just one GPU however, and as far as I can tell it's not just that it's taking an abnormally long amount of time to load.

Here is the command I'm using to evaluate my own trained SigLIP Prismatic VLM:

accelerate launch --num_processes=10 scripts/evaluate.py \
    --model_dir ../prismatic-vlms/runs/prism-siglip \
    --model_id prism-siglip \
    --dataset.type text-vqa-slim

which hangs on the line

| >> [*] Loading VLM prism-siglip-controlled+7b from Checkpoint; Freezing       load.py:98
Weights 🥶

This is being done over 10xRTX 3090's.

@siddk
Copy link
Collaborator

siddk commented Mar 27, 2024

This is super weird; haven’t seen this before. One thing — are all 10 GPUs you’re running on are on a single node?

@show981111
Copy link

Have you solved the issue? Mine hangs after loading from the checkpoint. I am using 2xV100.

@siddk
Copy link
Collaborator

siddk commented Apr 30, 2024

@show981111 - can you tell me where exactly in the code you're noticing the hanging? Can you also dump RAM/GPU Memory Utilization?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants